Big Data

This course takes a detailed look at how to implement Big Data solutions using Apache Spark. The course describes the problems that Big Data is designed to solve, and explains how Hadoop addresses these issues via HDFS, Yarn, and the Spark API.

We show plenty examples to help you understand how to create and use RDDs from various data sources, such as flat files, NoSQL databases, and relational databases. We also explore the key Spark APIs layered on top of RDDs, including Spark Streaming via DataFrames, Spark SQL, and Spark Machine Learning and Spark Graph Processing.

Duration

4 days

 

Prerequisites

  • Solid experience in Scala (or Python/Java)

What you'll learn

  • Big Data principles
  • Creating and using RDDs
  • Spark Streaming
  • Spark SQL
  • Spark Machine Learning
  • Spark Graph Processing

Course details

Introduction to Big Data

  • Introduction to Hadoop
  • Data serialization
  • Column-based storage
  • Messaging systems
  • NoSQL
  • Distributed SQL query engine

Introduction to Apache Spark

  • Key features of Spark
  • Spark architecture
  • Application execution
  • Resilient Distributed Datasets
  • Spark API
  • Caching
  • Spark jobs

Interactive Data Analysis with Spark Shell

  • Key concepts
  • REPL commands
  • Using Scala
  • Number analysis
  • Log analysis

Writing Spark Applications

  • Writing a Hello world application
  • Compiling and running an application
  • Monitoring and debugging an application

Spark Streaming

  • Overview of Spark streaming
  • Spark streaming API
  • Creating a discretized stream
  • Processing a discretized stream
  • Output operations

Spark SQL

  • Overview of Spark SQL
  • Performance considerations
  • Usage scenarios
  • Spark SQL API
  • Built-in functions

Machine Learning with Spark

  • Overview of Machine Learning
  • Spark Machine Learning Libraries (MLllb API)
  • Spark ML

Graph Processing with Spark

  • Overview of graphs
  • Overview of GraphX API
  • Using GraphX API

Cluster Managers

  • Standalone cluster manager
  • Apache Mesos
  • YARN