Big Data

This course takes a detailed look at how to implement Big Data solutions using Apache Spark. The course uses the Scala programming language, although we can also run it on Python or Java if required.

Duration

4 days

 

Prerequisites

  • Solid experience in Scala (or Python/Java)

What you'll learn

  • Big Data principles
  • Creating and using RDDs
  • Spark Streaming
  • Spark SQL
  • Spark Machine Learning
  • Spark Graph Processing

Course details

Introduction to Big Data

  • Introduction to Hadoop
  • Data serialization
  • Column-based storage
  • Messaging systems
  • NoSQL
  • Distributed SQL query engine

Introduction to Apache Spark

  • Key features of Spark
  • Spark architecture
  • Application execution
  • Resilient Distributed Datasets
  • Spark API
  • Caching
  • Spark jobs

Interactive Data Analysis with Spark Shell

  • Key concepts
  • REPL commands
  • Using Scala
  • Number analysis
  • Log analysis

Writing Spark Applications

  • Writing a Hello world application
  • Compiling and running an application
  • Monitoring and debugging an application

Spark Streaming

  • Overview of Spark streaming
  • Spark streaming API
  • Creating a discretized stream
  • Processing a discretized stream
  • Output operations

Spark SQL

  • Overview of Spark SQL
  • Performance considerations
  • Usage scenarios
  • Spark SQL API
  • Built-in functions

Machine Learning with Spark

  • Overview of Machine Learning
  • Spark Machine Learning Libraries (MLllb API)
  • Spark ML

Graph Processing with Spark

  • Overview of graphs
  • Overview of GraphX API
  • Using GraphX API

Cluster Managers

  • Standalone cluster manager
  • Apache Mesos
  • YARN