Apache Spark itself is a fast, distributed processing engine. As per the official documentation, Spark is 100x faster compared to traditional Map-Reduce processing. Another motivation of using Spark is the ease of use. You work with Apache Spark using any of your favorite programming language such as Scala, Java, Python, R, etc. In this article, we will check how to improve performance of iterative applications using Spark RDD cache and persist methods.
Spark RDD Cache and Persist
Spark RDD Caching or persistence are optimization techniques for iterative and interactive Spark applications.
Caching and persistence help storing interim partial results in memory or more solid storage like disk so they can be reused in subsequent stages. For example, interim results are reused when running an iterative algorithm like PageRank .
Why to Call Cache or Persist RDD?
Most of the RDD operations are lazy. Spark will simply create DAG, when you call the action, Spark will execute the series of operations to provide required results.
RDD.cache is also a lazy operation. Bu when you execute action for the first time, Spark will will persist the RDD in memory for subsequent actions if any.
Spark RDD Cache
Cache stores the intermediate results in MEMORY only. i.e. default storage of RDD cache is memory. RDD cache is merely persist with the default storage level MEMORY_ONLY.
Spark RDD Cache() Example
Below is the example of caching RDD using Pyspark. Same technique with little syntactic difference will be applicable to Scala caching as well.
# Create DataFrame for Testing
>>> df = sqlContext.createDataFrame([(10, 'ZZZ')],["id", "name"])
# Cache the dateFrame
>>> df.cache()
DataFrame[id: bigint, name: string]
# Test cached dataFrame
>>> df.count()
Spark RDD Persist
Cache is a synonym of persist or persist((pyspark.StorageLevel.MEMORY_ONLY). Due to the very small and syntactic difference between caching and persistence methods of RDDs, the two methods are often used interchangeably.
In the persist() method, we can use various storage levels such as MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY.
Related Article
Spark RDD Persist() Example
Below are the some of examples on using Spark RDD Persist with different storage levels in Pyspark.
>>> df.persist()
DataFrame[id: bigint, name: string]
>>> df.persist(pyspark.StorageLevel.MEMORY_ONLY)
DataFrame[id: bigint, name: string]
>>> df.persist(pyspark.StorageLevel.DISK_ONLY)
DataFrame[id: bigint, name: string]
>>> df.persist(pyspark.StorageLevel.MEMORY_AND_DISK)
DataFrame[id: bigint, name: string]
NameError: name 'MEMORY' is not defined
>>> df.persist(pyspark.StorageLevel.MEMORY_ONLY_SER)
DataFrame[id: bigint, name: string]
>>> df.persist(pyspark.StorageLevel.MEMORY_AND_DISK_SER)
DataFrame[id: bigint, name: string]
>>> df.persist(pyspark.StorageLevel.DISK_ONLY)
DataFrame[id: bigint, name: string]
Spark RDD Unpersist
You can call unpersist method to remove the RDD from persisted list.
For examples,
>>> df.unpersist()
19/07/08 10:51:16 INFO MapPartitionsRDD: Removing RDD 53 from persistence list
19/07/08 10:51:16 INFO BlockManager: Removing RDD 53
DataFrame[id: bigint, name: string]
Benefits of RDD Caching and Persistence in Spark
There are several benfits of RDD caching and persistence mechanism in spark.
- Cost efficient – No need to calculate the result at every stages.
- Time efficient – Computation will be faster
- It helps to reduce execution time of an iterative algorithms on very large datasets.
You don’t have to cache the dataFrame with small amount of data. It works well with large data sets.
Related Articles
- Spark SQL Performance Tuning – Improve Spark SQL Performance
- Python Pyspark Iterator-How to create and Use?
Hope this helps 🙂