Online Course For Free - Cloud Hadoop: Scaling Apache Spark

LinkedIn Learning

Free Trial Available

English

Certificate Available

3-4 hours worth of material

selfpaced

Overview

Generate genuine business insights from big data. Learn to implement Apache Hadoop and Spark workflows on AWS.

Apache Hadoop and Spark make it possible to generate genuine business insights from big data. The Amazon cloud is natural home for this powerful toolset, providing a variety of services for running large-scale data-processing workflows. Learn to implement your own Apache Hadoop and Spark workflows on AWS in this course with big data architect Lynn Langit.Explore deployment options for production-scaled jobs using virtual machines with EC2, managed Spark clusters with EMR, or containers with EKS.Learn how to configure and manage Hadoop clusters and Spark jobs with Databricks, and use Python or the programming language of your choice to import data and execute jobs. Plus, learn how to use Spark libraries for machine learning, genomics, and streaming. Each lesson helps you understand which deployment option is best for your workload.

Syllabus

Introduction

Scaling Apache Hadoop and Spark
What you should know
Using cloud services

1. Hadoop and Spark Fundamentals

Modern Hadoop and Spark
File systems used with Hadoop and Spark
Apache or commercial Hadoop distros
Hadoop and Spark libraries
Hadoop on Google Cloud Platform
Spark Job on Google Cloud Platform

2. AWS Cloud Spark Environments

Sign up for Databricks Community Edition
Add Hadoop libraries
Databricks AWS Community Edition
Load data into tables
Hadoop and Spark cluster on AWS EMR
Run Spark job on AWS EMR
Review batch architecture for ETL on AWS

3. Spark Basics

Apache Spark libraries
Spark data interfaces
Select your programming language
Spark session objects
Spark shell

4. Using Spark

Tour the Databricks Environment
Tour the notebook
Import and export notebooks
Calculate Pi on Spark
Run WordCount of Spark with Scala
Import data
Transformations and actions
Caching and the DAG
Architecture: Streaming for prediction

5. Spark Libraries

Spark SQL
SparkR
Spark ML: Preparing data
Spark ML: Building the model
Spark ML: Evaluating the model
Advanced machine learning on Spark
MXNet
Spark with ADAM for genomics
Spark architecture for genomics

6. Spark Streaming

Reexamine streaming pipelines
Spark Streaming
Streaming ingest services
Advanced Spark Streaming with MLeap

7. Scaling Spark on AWS and GCP

Scale Spark on the cloud by example
Build a quick start with Databricks AWS
Scale Spark cloud compute with VMs
Optimize cloud Spark virtual machines
Use AWS EKS containers and data lake
Optimize Spark cloud data tiers on Kubernetes
Build reproducible cloud infrastructure
Scale on GCP Dataproc or on Terra.bio

Conclusion

Continue learning for scaling

Taught by

Lynn Langit

Cloud Hadoop: Scaling Apache Spark

Overview

Syllabus

Taught by

Related Courses

Data Lakes for Big Data

Apache Spark 3 - Spark Programming in Python for Beginners

Cloud Hadoop: Scaling Apache Spark

PySpark Tutorial

Data Engineer, Big Data and ML on Google Cloud em Português

ElasticSearch, LogStash, Kibana ELK #2 - Learn LogStash