Pyspark Storagelevel and Explanation

Post author:Vithal S
Post last modified:July 8, 2019
Post category:Apache Spark
Reading time:4 mins read

The basic building block of an Apache Spark is RDD. The main abstraction Apache Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. In this article, we will check how to store the RDD using Pyspark Storagelevel. We will also check various storage levels with some examples.

Pyspark Storagelevel Explanation

Pyspark storagelevels are flags for controlling the storage of an resilient distributed dataset (RDD).

Each StorageLevel helps Spark to decide whether to

Use memory
Drop the RDD to disk if it falls out of memory
Keep the data in memory in a JAVA-specific (JVM) serialized format
Replicate the RDD partitions on multiple nodes

Also contains static constants for some commonly used storage levels, MEMORY_ONLY.

Pyspark Storagelevel

Below are the pyspark supported Storagelevel. You can use these storage levels based on how you want to store Spark RDD.

StorageLevel.DISK_ONLY = StorageLevel(True, False, False, False)
StorageLevel.DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
StorageLevel.MEMORY_ONLY = StorageLevel(False, True, False, False)
StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
StorageLevel.MEMORY_AND_DISK = StorageLevel(True, True, False, False)
StorageLevel.MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
StorageLevel.OFF_HEAP = StorageLevel(True, True, True, False, 1)

Pyspark Storagelevel Example

For simplicity, I have used pyspark shell to test storage levels. Alternatively, you can create Python file by creating spark context, RDD with storagelevel option.

Here is the sample example of using storage level in pyspark.

# Create RDD to test Spark storagelevel.

>>> test_rdd = sc.parallelize([1,2,3,4,5])
>>> test_rdd.persist( pyspark.StorageLevel.MEMORY_AND_DISK )
ParallelCollectionRDD[67] at parallelize at PythonRDD.scala:423
>>> test_rdd.getStorageLevel()
StorageLevel(True, True, False, True, 1)
>>> print(test_rdd.getStorageLevel())
Disk Memory Deserialized 1x Replicated

Pyspark storage level option is used with Persist() method available to cache the Spark RDD. You can find more about this method in my other post Spark RDD Cache and Persist to Improve Performance

Hope this helps 🙂

Tags: Apache Spark, RDD