Course Description
The Big Data Hadoop and Spark developer course has been designed to provide in-depth knowledge of Big Data processing using Hadoop and Spark in the AWS Elastic Map Reduce (EMR) environment. The practical hands-on aspects of the course focus on Amazon Web Services (AWS) and Amazon Elastic MapReduce (EMR), where all demos take place.
This is Part #1
Learning Outcomes
- Describe the different main Hadoop ecosystems such as Hadoop 3, Yarn, Pig, and Hive.
- Explain the functionality and architecture of the Hadoop Distributed File System (HDFS) and YARN resource management.
- Explain how Map Reduce works and how it is implemented in the Hadoop environment.
- Explain what the file formats used in BIg Data – Avro, Parquet, and Orc; and how to use them in the EMR environment.
- Explain the difference between traditional RDBMS and Hive tables.
- Explain the architecture and functionality of Spark
- Use resilient distribution datasets (RDD) for data processing in Spark.
- Implement and build Spark applications.
- Write a basic functional code in Scala to run a Spark application.
- Explain parallel processing in Spark and Spark optimization using Catalyst and Tungsten.
- Use Spark SQL, creating, transforming, and querying Data frames.
- Explain the differences and use cases for Spark RDDs, DataFrames, and DataSets.
- Create and deploy AWS EC2 instances, EBS, and S3 storage volumes.
- Explain the pricing models for AWS storage and compute resources.
- Configure and deploy an EMR Cluster in AWS.
- Run Spark and Map Reduce applications in batch and interactive mode on EMR.
- Create and use EMR notebooks.
- Explain the differences between IaaS and PaaS resources in AWS
- Create and manage a personal AWS account
Prerequisites
- A basic level of conceptual understanding of data warehouses is assumed, as well as an awareness of the core functions of SQL.
Who Should Attend
- Data Analysts
- Data Engineers
- Data Scientists
- Database Architects
- Database Administrators