The basic building block of an Apache Spark is RDD. The main abstraction Apache Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. In this article, we will check how to store the RDD using Pyspark Storagelevel. We will also check various storage levels with some examples.
Pyspark Storagelevel Explanation
Pyspark storagelevels are flags for controlling the storage of an resilient distributed dataset (RDD).
Each StorageLevel helps Spark to decide whether to
- Use memory
- Drop the RDD to disk if it falls out of memory
- Keep the data in memory in a JAVA-specific (JVM) serialized format
- Replicate the RDD partitions on multiple nodes
Also contains static constants for some commonly used storage levels, MEMORY_ONLY.
Pyspark Storagelevel
Below are the pyspark supported Storagelevel. You can use these storage levels based on how you want to store Spark RDD.
- StorageLevel.DISK_ONLY = StorageLevel(True, False, False, False)
- StorageLevel.DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
- StorageLevel.MEMORY_ONLY = StorageLevel(False, True, False, False)
- StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
- StorageLevel.MEMORY_AND_DISK = StorageLevel(True, True, False, False)
- StorageLevel.MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
- StorageLevel.OFF_HEAP = StorageLevel(True, True, True, False, 1)
Pyspark Storagelevel Example
For simplicity, I have used pyspark shell to test storage levels. Alternatively, you can create Python file by creating spark context, RDD with storagelevel option.
Related Article
Here is the sample example of using storage level in pyspark.
# Create RDD to test Spark storagelevel.
>>> test_rdd = sc.parallelize([1,2,3,4,5])
>>> test_rdd.persist( pyspark.StorageLevel.MEMORY_AND_DISK )
ParallelCollectionRDD[67] at parallelize at PythonRDD.scala:423
>>> test_rdd.getStorageLevel()
StorageLevel(True, True, False, True, 1)
>>> print(test_rdd.getStorageLevel())
Disk Memory Deserialized 1x Replicated
Pyspark storage level option is used with Persist() method available to cache the Spark RDD. You can find more about this method in my other post Spark RDD Cache and Persist to Improve Performance
Hope this helps 🙂