Skip to content

Understanding Apache Spark

Learning curve

If you know how to use Python's Pandas library, then migrating to PySpark will be a natural transition.

Understand that Apache Spark doesn't provide a data storage system, you have to provide that yourself. Instead, you can use an RDMS such as Postgres, and load it into your service using a JDBC, or by any other process you prefer.

Spark Core

One important concept for Apache Spark is the concept of a Resilient Distributed Dataset (RDD). This is a distributed collection of elements of your data. When you call transformation methods like map, filter, etc., you create a new RDD, but this RDD isn't operated on yet, it's only defined. When you call an action method like count, reduce, etc., then the RDD defined above is operated on. It's highly unlikely that you'll want to use these, instead considering opting for the DataFrame

Spark SQL

In this format you can access the data as a DataFrame, similar to an RDD?, where you define you data frames and operations conducted on them are executed on-demand.

Spark Streaming

A lightweight API that allows developers to perform batch processing and real-time streaming of data.


DAG

Spark represents itself using a DAG, When you apply transformations to Spark, it uses a directed acyclic graph to know what to do and how to reference the transformed data from its original source. However it, doesn't execute the commands immediately. It is lazy, this helps optimize run time.

Two types of commands

  • Transformations -> updates the DAG
  • Action -> Runs a computation on the DAG and returns it to you.