Data engineering with Azure Databricks

0
Microsoft Learn
Free Online Course
English
10-11 hours worth of material
selfpaced

Overview

  • Module 1: Describe Azure Databricks
  • In this module, you will:

    • Understand the Azure Databricks platform
    • Create your own Azure Databricks workspace
    • Create a notebook inside your home folder in Databricks
    • Understand the fundamentals of Apache Spark notebook
    • Create, or attach to, a Spark cluster
    • Identify the types of tasks well-suited to the unified analytics engine Apache Spark
  • Module 2: Spark architecture fundamentals
  • In this module, you will:

    • Understand the architecture of an Azure Databricks Spark Cluster
    • Understand the architecture of a Spark Job
  • Module 3: Read and write data in Azure Databricks
  • In this module, you will:

    • Use Azure Databricks to read multiple file types, both with and without a Schema.
    • Combine inputs from files and data stores, such as Azure SQL Database.
    • Transform and store that data for advanced analytics.
  • Module 4: Work with DataFrames in Azure Databricks
  • In this module, you will:

    • Use the count() method to count rows in a DataFrame
    • Use the display() function to display a DataFrame in the Notebook
    • Cache a DataFrame for quicker operations if the data is needed a second time
    • Use the limit function to display a small set of rows from a larger DataFrame
    • Use select() to select a subset of columns from a DataFrame
    • Use distinct() and dropDuplicates to remove duplicate data
    • Use drop() to remove columns from a DataFrame
  • Module 5: Describe lazy evaluation and other performance features in Azure Databricks
  • In this module, you will:

    • Describe the difference between eager and lazy execution
    • Define and identify transformations
    • Define and identify actions
    • Describe the fundamentals of how the Catalyst Optimizer works
    • Differentiate between wide and narrow transformations
  • Module 6: Work with DataFrames columns in Azure Databricks
  • In this module, you will:

    • Learn the syntax for specifying column values for filtering and aggregations
    • Understand the use of the Column Class
    • Sort and filter a DataFrame based on Column Values
    • Use collect() and take() to return records from a Dataframe to the driver of the cluster
  • Module 7: Work with DataFrames advanced methods in Azure Databricks
  • In this module, you will:

    • Manipulate date and time values in Azure Databricks
    • Rename columns in Azure Databricks
    • Aggregate data in Azure Databricks DataFrames
  • Module 8: Describe platform architecture, security, and data protection in Azure Databricks
  • In this module, you will:

    • Learn the Azure Databricks platform architecture and how it is secured.
    • Use Azure Key Vault to store secrets used by Azure Databricks and other services.
    • Access Azure Storage with Key Vault-based secrets.
  • Module 9: Build and query a Delta Lake
  • In this module, you will:

    • Learn about the key features and use cases of Delta Lake.
    • Use Delta Lake to create, append, and upsert tables.
    • Perform optimizations in Delta Lake.
    • Compare different versions of a Delta table using Time Machine.
  • Module 10: Process streaming data with Azure Databricks structured streaming
  • In this module, you will:

    • Learn the key features and uses of Structured Streaming.
    • Stream data from a file and write it out to a distributed file system.
    • Use sliding windows to aggregate over chunks of data rather than all data.
    • Apply watermarking to throw away stale old data that you do not have space to keep.
    • Connect to Event Hubs read and write streams.
  • Module 11: Describe Azure Databricks Delta Lake architecture
  • In this module, you will:

    • Process batch and streaming data with Delta Lake.
    • Learn how Delta Lake architecture enables unified streaming and batch analytics with transactional guarantees within a data lake.
  • Module 12: Create production workloads on Azure Databricks with Azure Data Factory
  • In this module, you'll:

    • Create an Azure Data Factory pipeline with a Databricks activity.
    • Execute a Databricks notebook with a parameter.
    • Retrieve and log a parameter passed back from the notebook.
    • Monitor your Data Factory pipeline.
  • Module 13: Implement CI/CD with Azure DevOps
  • In this module, you will:

    • Learn about CI/CD and how it applies to data engineering.
    • Use Azure DevOps as a source code repository for Azure Databricks notebooks.
    • Create build and release pipelines in Azure DevOps to automatically deploy a notebook from a development to a production Azure Databricks workspace.
  • Module 14: Integrate Azure Databricks with Azure Synapse
  • In this module, you will:

    • Access Azure Synapse Analytics from Azure Databricks by using the - SQL Data Warehouse connector.
  • Module 15: Describe Azure Databricks best practices
  • In this module, you will learn best practices in the following categories:

    • Workspace administration
    • Security
    • Tools & integration
    • Databricks runtime
    • HA/DR
    • Clusters

Syllabus

  • Module 1: Describe Azure Databricks
    • Introduction
    • Explain Azure Databricks
    • Create an Azure Databricks workspace and cluster
    • Understand Azure Databricks Notebooks
    • Exercise: Work with Notebooks
    • Knowledge check
    • Summary
  • Module 2: Spark architecture fundamentals
    • Introduction
    • Understand the architecture of Azure Databricks spark cluster
    • Understand the architecture of spark job
    • Knowledge check
    • Summary
  • Module 3: Read and write data in Azure Databricks
    • Introduction
    • Read data in CSV format
    • Read data in JSON format
    • Read data in Parquet format
    • Read data stored in tables and views
    • Write data
    • Exercises: Read and write data
    • Knowledge check
    • Summary
  • Module 4: Work with DataFrames in Azure Databricks
    • Introduction
    • Describe a DataFrame
    • Use common DataFrame methods
    • Use the display function
    • Exercise: Distinct articles
    • Knowledge check
    • Summary
  • Module 5: Describe lazy evaluation and other performance features in Azure Databricks
    • Introduction
    • Describe the difference between eager and lazy execution
    • Describe the fundamentals of how the Catalyst Optimizer works
    • Define and identify actions and transformations
    • Describe performance enhancements enabled by shuffle operations and Tungsten
    • Knowledge check
    • Summary
  • Module 6: Work with DataFrames columns in Azure Databricks
    • Introduction
    • Describe the column class
    • Work with column expressions
    • Exercise: Washingtons and Marthas
    • Knowledge check
    • Summary
  • Module 7: Work with DataFrames advanced methods in Azure Databricks
    • Introduction
    • Perform date and time manipulation
    • Use aggregate functions
    • Exercise: Deduplication of data
    • Knowledge check
    • Summary
  • Module 8: Describe platform architecture, security, and data protection in Azure Databricks
    • Introduction
    • Describe the Azure Databricks platform architecture
    • Perform data protection
    • Describe Azure key vault and Databricks security scopes
    • Secure access with Azure IAM and authentication
    • Describe security
    • Exercise: Access Azure Storage with key vault-backed secrets
    • Knowledge check
    • Summary
  • Module 9: Build and query a Delta Lake
    • Introduction
    • Describe the open source Delta Lake
    • Exercise: Work with basic Delta Lake functionality
    • Describe how Azure Databricks manages Delta Lake
    • Exercise: Use the Delta Lake Time Machine and perform optimization
    • Knowledge check
    • Summary
  • Module 10: Process streaming data with Azure Databricks structured streaming
    • Introduction
    • Describe Azure Databricks structured streaming
    • Perform stream processing using structured streaming
    • Work with Time Windows
    • Process data from Event Hubs with structured streaming
    • Knowledge check
    • Summary
  • Module 11: Describe Azure Databricks Delta Lake architecture
    • Introduction
    • Describe bronze, silver, and gold architecture
    • Perform batch and stream processing
    • Knowledge check
    • Summary
  • Module 12: Create production workloads on Azure Databricks with Azure Data Factory
    • Introduction
    • Schedule Databricks jobs in a data factory pipeline
    • Pass parameters into and out of Databricks jobs in data factory
    • Knowledge check
    • Summary
  • Module 13: Implement CI/CD with Azure DevOps
    • Introduction
    • Describe CI/CD
    • Create a CI/CD process with Azure DevOps
    • Knowledge check
    • Summary
  • Module 14: Integrate Azure Databricks with Azure Synapse
    • Introduction
    • Integrate with Azure Synapse Analytics
    • Knowledge check
    • Summary
  • Module 15: Describe Azure Databricks best practices
    • Introduction
    • Understand workspace administration best practices
    • List security best practices
    • Describe tools and integration best practices
    • Explain Databricks runtime best practices
    • Understand cluster best practices
    • Knowledge check
    • Summary