COURSE INTRODUCTION
Spark is an open-source query engine for processing data large datasets, and it integrates well with the Python programming language. PySpark is the interface that gives access to Spark using Python. This course starts with an overview of the Spark stack and will show you how to leverage the functionality of Python as you deploy it in the Spark ecosystem. The course will then give you a deeper look at Apache Spark architecture and how to set up a Python environment for Spark. You'll learn about various techniques for collecting data, RDDs and contrast them with DataFrames, how to read data from files and HDFS, and how to work with schemas.
Finally, the course will teach you how to use SQL to interact with DataFrames. By the end of this PySpark course, you will have learned how to process data using Spark DataFrames and mastered data collection techniques through distributed data processing.
COURSE OBJECTIVES
After completing this course, students will have knowledge and skills to:
AUDIENCE
COURSE CONTENT
Lesson 1 - A Brief Primer on PySpark
Lesson 2 - Resilient Distributed Datasets
Lesson 3 - Resilient Distributed Datasets and Actions
Lesson 4 - DataFrames and Transformations
Lesson 5 - Data Processing with Spark DataFrames