An iterator is an object in Python representing a stream of data. You can create an iterator object by applying the iter() built-in function to an iterable dataset. In python, you can create your own iterator from list, tuple. For example, the list is an iterator and you can run a for loop over a list. In this article, we will check Python Pyspark iterator, how to create and use it.
Python Pyspark Iterator
As you know, Spark is a fast distributed processing engine. It uses RDD to distribute the data across all machines in the cluster. The Python iter() will not work on pyspark.
Pyspark provides its own methods called “toLocalIterator()“, you can use it to create an iterator from spark dataFrame.
Pyspark toLocalIterator
The toLocalIterator method returns an iterator that contains all of the elements in the given RDD. The iterator will consume as much memory as the largest partition in this RDD.
As for the toLocalIterator, it is used to collect the data from the RDD scattered across your cluster into only one node, the one from which the program is running, and perform the certain task with all the data in the same node. It is similar to the collect method, but instead of returning a List, it will return an Iterator object.
Syntax
Below is the syntax that you can use to create iterator in Python pyspark:
rdd.toLocalIterator()
Pyspark toLocalIterator Example
You can directly create the iterator from spark dataFrame using above syntax. Below is the example for your reference:
# Create DataFrame
sample_df = sqlContext.sql("select * from sample_tab1")
# Ceate Iteraor
iter_var = sample_df.rdd.toLocalIterator()
How to get data from Pyspark Iterator?
You can use ‘next’ method to get the data our of the pyspark iterator. However, ‘next’ returns only row object.
>>> next(iter_var)
Row(id=1, name=u'AAA')
You can access the individual value by qualifying row object with column names.
You can use any of the below methods to get data for given column.
>>> next(iter_var).id
2
>>> next(iter_var)['id']
3
Related Articles,
- Spark RDD Cache and Persist to Improve Performance
- Create Pyspark sparkContext within python Program
- Execute Pyspark Script from Python and Examples
Hope this helps 🙂