Learn how to build data pipelines using Apache Spark with Python and AWS cloud in a completely case-study-based approach or learn-by-doing approach.
Apache Spark is a fast and general-purpose distributed computing system. It provides high-level APIs in Scala, Java, Python and R, and an optimised engine that supports general execution graphs (DAG). It also supports a rich set of high level APIs and tools including DataFrame for Structured data processing using Domain Specific Language (DSL) and SQL, Structured Streaming for real-time stream processing with Apache Kafka, Databricks Delta Lake for ACID compliant data lake, MLlib for machine learning and GraphX for graph processing. It’s also available as a Service as well (Spark as Service) – Databricks, AWS Glue, etc.
Note: This is not just an introductory or theory based course, it’s full with real time case studies – starting from basic data transformation using RDD/DataFrame/Dataset/StreamDataset to deploying full-fledged big data pipelines on multi-node cluster over AWS cloud, monitoring and tuning jobs in production.
Learning Outcomes. By the end of this course,
- You will be able to setup the development environment in your local machine (IntelliJ, Python, Git, etc.) and start working on any given big data application, then deploy the same on AWS cloud using EMR cluster, Lambda scripts/CloudFormation and Step Function. Monitor and tune the same as well.
- You will be able to identify the type of the pipelines if it’s batch based (DataFrame/Dataset) or streaming (Structured Streaming with Kafka).
- Basis the nature of the data (Confidential/PII or not) and the client you’ll be working with, you’ll be able to decide whether to go ahead with on-premise (Kubernetes, Hortonworks or Cloudera) or cloud based (AWS, Azure, GCP or Databricks) solution . You can also estimate the computational resources requirement for the given data volume or complexity of the pipeline.
- This ~50 hours programme with ~100 hands-on exercises would prepare you as a ~3 years experience Big Data developer.
Recommended background: You should have some basic programming knowledge in any language, i.e., variable declaration, conditional expression – if-else, switch statement, control statements, collections, OOPs, etc.
PART – 1: Getting Started with Spark Core – RDD using Databricks Notebook
Setting up your 1st big data cluster using Databricks Community Edition. Introduction to Spark RDD. Transformations and Actions. Distributed key-value pairs (Pair RDD).
- Writing your 1st Spark Program – Word Count example using Databricks Notebook
- What is RDD?
- How to creating RDD?
- By reading data from external sources – different file formats (text file, sequence file, object file, etc.), Python Collections, etc.
- By applying transformation to existing RDDs
- Applying transformation – map(), flatMap(), filter(), mapPartitions(), mapPartitionsWithIndex(), etc.
- Transformation and Actions
- Pair RDD – groupByKey(), reduceByKey(), aggregateByKey(), cogroup(), foldByKey(), join(), etc.
- Narrow transformation vs Wider transformation
- Spark job execution model, application_id –>> jobs –>> stages –>> tasks
PART – 2: Setting up your local environment for programming Python and Spark
Setting up Python 3.7 and PyCharm IDE. Python basics – Functional Programming, OOPs Concepts, Collections, Exception Handling. Challenges with RDD.
- Introduction to Python – variable declaration, conditional expression, pattern matching, iterations, functions, lambda, higher order functions, closures
- Object Orientation – class, object, decorators
- Generators and comprehensions
- Python Collections – List, Set, Map, Tuple, higher order functions on collections.
- Exception Handling
PART – 3: Working with Structured Data using DataFrame
Implementing ETL/Data Pipelines using Spark’s DataFrame/Dataset API through 3 steps,
- Data Ingestion
- Data Curation
- Data Provisioning
- Data Ingestion (Reading data and creating DataFrame)
- Converting RDDs to DataFrames through implicit schema inference or by specifying schema explicitly.
- Reading structured data using spark.read.<file_format>() and spark.read.format(“<file_format>”) family of functions
- spark.sql(“select * from parquet.`path`”).
- Reading data from 3rd party sources like RDBMS (like MySQL), NoSQL DB (like MongoDB), SFTP Server, Cloud Storage (like Amazon S3), etc.
- Data Curation (Applying data cleansing and business transformations)
- Using Domain Specific Language (DSL), e.g. df.select(), df.groupBy($”col1′′).sum($”col2′′), etc.
- Spark SQL, e.g. spark.sql(“select * from temp_vw”)
- Applying window/analytics function using Dataframe’s DSL and Spark SQL. e.g. rank(), dense_rank(), lead(), lag(), etc.
- Data Provisioning (Making the curated data available for consumption by the downstream applications)
- In order to make the curated data available for querying using SQL, DataFrames can be written to Hive or MPP databases like Impala, Presto, AWS Redshift or AWS Athena
- If the above is semi-structured, then it can be written to NoSQL DB (like MongoDB)
- Put it in HDFS or any cloud storage if there are whole bunch of Spark application use this data in the dowstream.
- Dataset (Typed dataFrame)
- Solving compile time safety, domain object information, functional programming issues with DataFrame.
- typed and untyped transformations
- Interoperability – Converting RDD to Dataset and vice versa, Dataframe to Dataset and vice versa
- Deploying and monitoring spark application using AWS EMR, Lambda, Step Function and CloudWatch
PART – 4: Deploying Spark applications on AWS cloud, Structured Streaming, AWS Glue and Delta Lake
- Productionalizing Spark job over AWS cloud
- Manual approach
- Create a 2-node EMR cluster with m5.xlarge EC2 instance type
- Build your application as a jar file and copy it to the master node
- Establish a ssh session with the master node and run the spark-submit command
- Through AWS Lambda script
- Build the application as a jar file and copy it to AWS S3 bucket
- Write a lambda script using Python language (it’s easy and a template available on aws site) and boto3 library to create a 2-node EMR cluster, run the job and terminate the cluster once the job execution completed
- We can trigger it from AWS Cloudwatch
- Through AWS Step Function
- For complex pipelines where multiple spark jobs run in parallel or in sequence, we can orchestrate the jobs using AWS Step function (an alternative to Apache Oozie or Airflow).
- Trigger the above AWS lambda scripts as different tasks from a State machine
- We can trigger it from AWS Cloudwatch as well
- Manual approach
- AWS Glue – Spark as Service
- Creating crawlers and Glue catalogs
- Understanding DynamicFrames and its interoperability with Spark’s DataFrame
- Creating Glue pipelines
- Databricks’ Delta Lake
- Understanding challenges of traditional data lake
- Working with Delta tables, enabling ACID properties
- Understanding schema enforcement
- Structured Streaming
- Understanding the High Level Streaming API in Spark 2.x
- Triggers and Output modes
- Unified APIs for Batch and Streaming
- Building Advanced Streaming Pipelines Using Structured Streaming
- Stateful window operations
- Tumbling and Sliding windows
- Watermarks and late data
- Windowed joins
- Integrating Apache Kafka with Structured Streaming
PART – 5: Real time case study
To make things fun and interesting, we will introduce multiple datasets coming from disparate data sources – SFTP, MS SQLServer, Amazon S3 and Google Analytics. And create an industry standard ETL pipeline to populate a data mart implemented on Amazon Redshift.