Apache Spark Architecture, Design and Overview

Apache Spark is a fast, open source and general-purpose cluster computing system with an in-memory data processing engine. Apache Spark is written in Scala and it provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Apache Spark architecture is designed in such a way that you can use it for ETL (Spark SQL), analytics, machine learning (MLlib), graph processing or building streaming application (spark streaming). Spark is often called cluster computing engine or simply execution engine.

Apache Spark is one of most contributed Apache open source project with more than 500 contributors from across 200. Biggies like Alibaba, Amazon, Baidu, Yandex etc uses Apache Spark at scale.

In this article, we will check Apache spark architecture overview.

Apache Spark Architecture

The Apache Spark project is main execution engine for Spark SQL ( SQL and HiveQL) , Spark streaming, machine learning and graph processing engines built on top of Spark Core. You can run them using provided API.

As you can see in above diagram, you can run Apache spark in three different ways:

Standalone – The Hadoop cluster can be equipped with all the resources and Spark can run with MapReduce in parallel. Spark Standalone mode is the easiest to set up and will provide almost all features as the other cluster managers if you are only running Spark.

On Hadoop YARN – Spark can be executed on top of YARN without any pre-installation. YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.

Apache Mesos – Apache Mesos, a distributed systems kernel, has HA for masters and slaves, can manage resources per application, and has support for Docker containers. It can run Spark jobs, Hadoop MapReduce, or any other service application.

Apache Spark Core – Resilient Distributed Datasets (RDD)

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel
Resilient Distributed Datasets (RDD) are a distributed memory abstraction that allows programmers perform in-memory computations on big clusters and is a fault-tolerant i.e. RDDs automatically recover from node failures.

RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations.

Different Features of RDDs

Below are the some of the important features of RDDs.

Immutable

RDDs are immutable, you can’t change them once created. However, you can transform one RDD into other by using various transformations like map, join, filter etc.

Partitioned

RDD contains various records that are partitioned. These partitioned are basic units of parallelism. Each RDD is divided into small logical data which is known as partition and is mutable. You can create partitions through using transformations on existing partitions.

Persistence

You can specify which RDD you want to reuse and can store them in memory or on disks.

Fault Tolerance

Spark RDDs are fault tolerant. They track lineage graph information to rebuild the damaged or lost data automatically.

In-memory Computation

Spark RDDs supports in memory computation. It stores all required intermediate results in memory (RAM) instead of disks.

Lazy Evaluation

All transformations in Apache Spark are lazy, means they do not compute their results immediately. Instead, they just remember the transformations applied to some base data set. Data inside RDDs will not be transformed unless, an action that triggers the execution of transformations is invoked.

Parallel

Spark RDD process data in parallel.

Typed

You can create different types of RDDs such as RDD[string], RDD[int], RDD[long].