PySpark | Smartpro.vn

PySpark

COURSE INTRODUCTION

Spark is an open-source query engine for processing data large datasets, and it integrates well with the Python programming language. PySpark is the interface that gives access to Spark using Python. This course starts with an overview of the Spark stack and will show you how to leverage the functionality of Python as you deploy it in the Spark ecosystem. The course will then give you a deeper look at Apache Spark architecture and how to set up a Python environment for Spark. You'll learn about various techniques for collecting data, RDDs and contrast them with DataFrames, how to read data from files and HDFS, and how to work with schemas.

Finally, the course will teach you how to use SQL to interact with DataFrames. By the end of this PySpark course, you will have learned how to process data using Spark DataFrames and mastered data collection techniques through distributed data processing.

COURSE OBJECTIVES

After completing this course, students will have knowledge and skills to:

Get an overview of Apache Spark and the Spark 2.0 architecture
Obtain a comprehensive knowledge of various tools that fall under the Spark ecosystem such as Spark SQL, Spark MlLib, Sqoop, Kafka, Flume and Spark Streaming
Understand schemas for RDD, lazy executions, and transformations, and learn how to change the schema of a DataFrame
Build and interact with Spark DataFrames using Spark SQL
Create and explore various APIs to work with Spark DataFrames
Learn how to aggregate, transform, filter, and sort data with DataFrames

AUDIENCE