Spark RDD Cache and Persist to Improve Performance

Apache Spark itself is a fast, distributed processing engine. As per the official documentation, Spark is 100x faster compared to traditional Map-Reduce processing. Another motivation of using Spark is the ease of use. You work with Apache Spark using any of your favorite programming language such as Scala, Java, Python, R, etc. In this article, we will check how to improve performance of iterative applications using Spark RDD cache and persist methods.

Spark RDD Cache and Persist

Spark RDD Caching or persistence are optimization techniques for iterative and interactive Spark applications.

Caching and persistence help storing interim partial results in memory or more solid storage like disk so they can be reused in subsequent stages. For example, interim results are reused when running an iterative algorithm like PageRank .

Why to Call Cache or Persist RDD?

Most of the RDD operations are lazy. Spark will simply create DAG, when you call the action, Spark will execute the series of operations to provide required results.

RDD.cache is also a lazy operation. Bu when you execute action for the first time, Spark will will persist the RDD in memory for subsequent actions if any.

Spark RDD Cache

Cache stores the intermediate results in MEMORY only. i.e. default storage of RDD cache is memory. RDD cache is merely persist with the default storage level MEMORY_ONLY.

Spark RDD Cache() Example

Below is the example of caching RDD using Pyspark. Same technique with little syntactic difference will be applicable to Scala caching as well.

# Create DataFrame for Testing
>>> df = sqlContext.createDataFrame([(10, 'ZZZ')],["id", "name"])

# Cache the dateFrame
>>> df.cache()
DataFrame[id: bigint, name: string]

# Test cached dataFrame
>>> df.count()

Spark RDD Persist

Cache is a synonym of persist or persist((pyspark.StorageLevel.MEMORY_ONLY). Due to the very small and syntactic difference between caching and persistence methods of RDDs, the two methods are often used interchangeably.

In the persist() method, we can use various storage levels such as MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY.

Spark RDD Persist() Example

Below are the some of examples on using Spark RDD Persist with different storage levels in Pyspark.

>>> df.persist()
DataFrame[id: bigint, name: string]

>>> df.persist(pyspark.StorageLevel.MEMORY_ONLY)
DataFrame[id: bigint, name: string]

>>> df.persist(pyspark.StorageLevel.DISK_ONLY)
DataFrame[id: bigint, name: string]

>>> df.persist(pyspark.StorageLevel.MEMORY_AND_DISK)
DataFrame[id: bigint, name: string]
NameError: name 'MEMORY' is not defined

>>> df.persist(pyspark.StorageLevel.MEMORY_ONLY_SER)
DataFrame[id: bigint, name: string]

>>> df.persist(pyspark.StorageLevel.MEMORY_AND_DISK_SER)
DataFrame[id: bigint, name: string]

>>> df.persist(pyspark.StorageLevel.DISK_ONLY)
DataFrame[id: bigint, name: string]

Spark RDD Unpersist

You can call unpersist method to remove the RDD from persisted list.

For examples,

>>> df.unpersist()
19/07/08 10:51:16 INFO MapPartitionsRDD: Removing RDD 53 from persistence list
19/07/08 10:51:16 INFO BlockManager: Removing RDD 53
DataFrame[id: bigint, name: string]

Benefits of RDD Caching and Persistence in Spark

There are several benfits of RDD caching and persistence mechanism in spark.

Cost efficient – No need to calculate the result at every stages.
Time efficient – Computation will be faster
It helps to reduce execution time of an iterative algorithms on very large datasets.

You don’t have to cache the dataFrame with small amount of data. It works well with large data sets.

Hope this helps 🙂