Cloud Hadoop: Scaling Apache Spark

0
Join & Subscribe
LinkedIn Learning
Free Trial Available
English
Certificate Available
3-4 hours worth of material
selfpaced

Overview

Generate genuine business insights from big data. Learn to implement Apache Hadoop and Spark workflows on AWS.

Apache Hadoop and Spark make it possible to generate genuine business insights from big data. The Amazon cloud is natural home for this powerful toolset, providing a variety of services for running large-scale data-processing workflows. Learn to implement your own Apache Hadoop and Spark workflows on AWS in this course with big data architect Lynn Langit.Explore deployment options for production-scaled jobs using virtual machines with EC2, managed Spark clusters with EMR, or containers with EKS.Learn how to configure and manage Hadoop clusters and Spark jobs with Databricks, and use Python or the programming language of your choice to import data and execute jobs. Plus, learn how to use Spark libraries for machine learning, genomics, and streaming. Each lesson helps you understand which deployment option is best for your workload.

Syllabus

Introduction
  • Scaling Apache Hadoop and Spark
  • What you should know
  • Using cloud services
1. Hadoop and Spark Fundamentals
  • Modern Hadoop and Spark
  • File systems used with Hadoop and Spark
  • Apache or commercial Hadoop distros
  • Hadoop and Spark libraries
  • Hadoop on Google Cloud Platform
  • Spark Job on Google Cloud Platform
2. AWS Cloud Spark Environments
  • Sign up for Databricks Community Edition
  • Add Hadoop libraries
  • Databricks AWS Community Edition
  • Load data into tables
  • Hadoop and Spark cluster on AWS EMR
  • Run Spark job on AWS EMR
  • Review batch architecture for ETL on AWS
3. Spark Basics
  • Apache Spark libraries
  • Spark data interfaces
  • Select your programming language
  • Spark session objects
  • Spark shell
4. Using Spark
  • Tour the Databricks Environment
  • Tour the notebook
  • Import and export notebooks
  • Calculate Pi on Spark
  • Run WordCount of Spark with Scala
  • Import data
  • Transformations and actions
  • Caching and the DAG
  • Architecture: Streaming for prediction
5. Spark Libraries
  • Spark SQL
  • SparkR
  • Spark ML: Preparing data
  • Spark ML: Building the model
  • Spark ML: Evaluating the model
  • Advanced machine learning on Spark
  • MXNet
  • Spark with ADAM for genomics
  • Spark architecture for genomics
6. Spark Streaming
  • Reexamine streaming pipelines
  • Spark Streaming
  • Streaming ingest services
  • Advanced Spark Streaming with MLeap
7. Scaling Spark on AWS and GCP
  • Scale Spark on the cloud by example
  • Build a quick start with Databricks AWS
  • Scale Spark cloud compute with VMs
  • Optimize cloud Spark virtual machines
  • Use AWS EKS containers and data lake
  • Optimize Spark cloud data tiers on Kubernetes
  • Build reproducible cloud infrastructure
  • Scale on GCP Dataproc or on Terra.bio
Conclusion
  • Continue learning for scaling

Taught by

Lynn Langit