PySpark

COURSE INTRODUCTION

Spark is an open-source query engine for processing data large datasets, and it integrates well with the Python programming language. PySpark is the interface that gives access to Spark using Python. This course starts with an overview of the Spark stack and will show you how to leverage the functionality of Python as you deploy it in the Spark ecosystem. The course will then give you a deeper look at  Apache Spark architecture and how to set up a Python environment for Spark. You'll learn about various techniques for collecting data, RDDs and contrast them with DataFrames, how to read data from files and HDFS, and how to work with schemas.

Finally, the course will teach you how to use SQL to interact with DataFrames. By the end of this PySpark course, you will have learned how to process data using Spark DataFrames and mastered data collection techniques through distributed data processing.

COURSE OBJECTIVES

After completing this course, students will have knowledge and skills to:

  • Get an overview of Apache Spark and the Spark 2.0 architecture
  • Obtain a comprehensive knowledge of various tools that fall under the Spark ecosystem such as Spark SQL, Spark MlLib, Sqoop, Kafka, Flume and Spark Streaming
  • Understand schemas for RDD, lazy executions, and transformations, and learn how to change the schema of a DataFrame
  • Build and interact with Spark DataFrames using Spark SQL
  • Create and explore various APIs to work with Spark DataFrames
  • Learn how to aggregate, transform, filter, and sort data with DataFrames

AUDIENCE

  • Freshers willing to start a career in Big Data
  • Developers and architects
  • BI/ETL/DW professionals
  • Mainframe professionals
  • Big Data architects, engineers, and developers
  • Data scientists and analytics professionals

COURSE CONTENT

Lesson 1 - A Brief Primer on PySpark

Lesson 2 - Resilient Distributed Datasets

Lesson 3 - Resilient Distributed Datasets and Actions

Lesson 4 - DataFrames and Transformations

Lesson 5 - Data Processing with Spark DataFrames

CÓ THỂ BẠN QUAN TÂM
Array
(
)