Big Data Pipeline using Apache Spark and AWS

Current Status
Not Enrolled
Get Started

You are not yet enrolled in this course.


Learn how to deploy/productionalize big data pipelines (Apache Spark with Scala Projects) on AWS cloud in a completely case-study-based approach or learn-by-doing approach. This pipeline can be triggered as a REST API.

Learning Outcomes. By the end of this course,

  • One will be able to setup the development environment in your local machine (IntelliJ, Scala/Python, Git, etc.) and start on any given big data project from scratch, unit testing, validating source to staging and source to final data mart/warehouse.
  • Setup the Jenkins CI/CD pipeline manually as well docker containerising the same.
  • Automating the pipeline.
    1. Creating an AWS EMR cluster through AWS CloudFormation Template,
    2. Deploying Spark job as a step in the EMR cluster through AWS Lambda (boto3 lib)
    3. Orchestrating multiple Spark jobs using AWS State Machine (AWS Step Function)
    4. Scheduling the state machine using AWS CloudWatch Rule – Crontab.
  • Creating an API to trigger the above pipeline.
    1. Creating a REST API using AWS API Gateway + AWS Lambda
    2. Notify an API hit (spark-submit command through Postman or curl command) to the job queue using AWS SNS
    3. Control the number of jobs executing in the cluster (basis the cluster capacity) using AWS SQS
    4. Log the job status in AWS DynamoDB – Queued, Processing, Successful, Failed

Recommended background: One should be having some background in Big Data Technologies or Apache Spark with Scala/Python/Java and AWS cloud. This project can help someone who is trying for Lead Data Engineer, Architect or similar roles.

PART – 1: Develop an application using Apache Spark with Scala/Python

Problem Statement: This is a typical batch based data pipeline (i) Data Ingestion to, (ii) Data Curation to, (iii) Data Provisioning and it’s designed to simulate processing incremental and full batches.
The input data contains company attributes from many data sources,

  • Each company will have a set containing one attribute value per attribute type
  • Each attribute has a numeric attribute type indicating the type of data in the attribute. For example, attribute_type 49 stores the industry of the company
  • Each unique source has a numeric source id
  • In every input batch, data is loaded from a subset of sources and may only cover part of the companies in the dataset. In other words, each input batch contains parts of the final company records you will produce in the output.

Processing – Application Parameters:

  • inputPath: String – Full path to the input data for each run.
  • outputPath: String – Full path to the output location where data will be written.
  • batchId: Long – The batch id to process
  • lastFullBatchId: Option[Long] – The batch id indicating the most recent full batch available, None for first run.
  • createFullBatch: Boolean – If true, then create a full batch by combining previous full batch plus incremental batches since. If false, only create an incremental batch output with the changes in the current batch.

PART – 2: Setting up the Jenkins CI/CD pipeline

Setting up Jenkins docker container and creating a CI/CD job.

  • Setting up Jenkins docker container along with the required plugins – Git, SBT, AWS S3
  • Creating a CI/CD job to build the Spark with Scala job as a jar file. Note: No build required for PySpark application
  • Then copy the jar file in AWS S3 bucket, so that the same jar file can later be picked up by the AWS Lambda script and submit a step to AWS State Machine.

PART – 3: Setting up the AWS stacks and making the job execute through an API


  • Creating AWS CloudFormation templates to create,
    1. An AWS EMR cluster with auto-scaling enabled with it
    2. An AWS State Machine that will submit a spark job to an active AWS EMR cluster and keep checking the job status every 10 minutes
    3. An AWS SQS queue to control the number of jobs to be running in a cluster
    4. An SNS topic to notify the above queue that a Spark job submission request has been raised
    5. An AWS API Gateway to trigger a Spark job
    6. An AWS DynamoDB to log the job statuses (Queued, Submitted, Running, Successful, Failed, etc.)
  • Orchestrate the above components together to build a complete flow
  • Monitor the job using the YARN’s Resource Manager page and Ganglia Monitoring