Basic Spark Transformations and Actions using pyspark

Apache Spark provides two kinds of operations: Transformations and Actions. We will check the commonly used basic Spark Transformations and Actions using pyspark.

Spark Basic Transformations and Actions using pyspark

Create RDD from Local File

You can use textFile spark context method to create RDD from local or HDFS file systems

rdd = sc.textFile("file:////home/impadmin/test.txt")

Create RDD from HDFS File

rdd = sc.textFile("hdfs:/localhost:8020/home/impadmin/test.txt")

Basic Spark Transformations

Transformations are Spark operation which will transform one RDD into another. Transformations will always create new RDD from original one.

Below are some basic transformations in Spark:

map()
flatMap()
filter()
groupByKey()
reduceByKey()
sample()
union()
distinct()

map ()

The “map” transformation apply lambda functions to all elements of the RDD and return new RDD.

Convert all values in RDD to UPPER case. You can either create separate function to convert values to uppercase or write lambda function in map transformation.

rdd1 = rdd.map(lambda x: x.upper(), rdd.values)

As per above examples, we have transformed rdd into rdd1

flatMap()

The “flatMap” transformation will return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

filter()

To remove the unwanted values, you can use a “filter” transformation which will return a new RDD containing only the elements that satisfy given condition(s).

remove_values = ['ERTE','SADFS']
rdd2 = rdd1.filter(lambda x: x not in remove_values)

Above examples will apply the filter transformation on rdd1 to remove values mention in remove_values and create new RDD.

groupByKey() and reduceByKey()

You can apply groupByKey and reduceByKey transformations on key value pair RDD. The “groupByKey”

will group the values for each key in the original RDD. It will create a new pair, where the original key corresponds to this collected group of values.

rdd_tmp = rdd.map(lambda x: (x,1))
rdd_grp = rdd_tmp.groupByKey()

sample()

The “Sample” transformation will allow you to work on sample data set.

rdd_sample = rdd.sample(False, .2, 4)

The sample method will return a new RDD, containing a statistical sample of the original RDD. Given examples, select 20% (.2) of total original RDD values.

Union()

You can combine content of two different RDD’s using “union” transformations. Result will be new RDD with all values from two RDD’s.

rdd_sample1 = rdd.sample(False, .2, 4)
rdd_sample2 = rdd.sample(False, .4, 4)
union_samples = rdd_sample1.union(rdd_sample2)

distinct()

You can select distinct elements from RDD using “distinct” transformation.

rdd_distinct = union_samples.distinct()

Distinct transformation will create new RDD containing distinct elements from the original RDD.

Basic Spark Actions

Actions in the spark are operations that provide non-RDD values. Actions will not create RDD like transformations.

Below are some of the commonly used action in Spark.

Collect()
take(n)
count()
max()
min()
sum()
variance()
stdev()
Reduce()

Collect()

Collect is simple spark action that allows you to return entire RDD content to drive program.

rdd_distinct.collect()

take(n)

You can use “take” action to display sample elements from RDD.

You can check first 5 values from RDD using ‘take’ action.

rdd.take(5)

count()

The “count” action will count the number of elements in RDD.

rdd_distinct.count()

max()

The “max” action will display the max elements from RDD.

rdd_int.max()

min()

The “min” action will display the min elements from RDD.

rdd_int.min()

sum()

The “sum” action will display the sum of all elements from RDD.

rdd_int.sum()

variance()

The “variance” action will display the variance of all elements from RDD.

rdd_int.variance()

stdev()

The “stdev” action will display the stdev of all elements from RDD.

rdd_int.stdev ()

Reduce()

The “reduce” action takes the two elements as input from the RDD and then produces the output of the same type as that of the input elements.

num_rdd = sc.parallelize(range(0,100))
num_rdd.reduce(lambda x,y: x+y)

Hope this helps ?