Import CSV file to Pyspark DataFrame

Many organization uses a flat file format such as CSV or TSV to offload their tables. Managing flat file is easy and can be transported by any electronic medium. In this article we will check how to import CSV file to Pyspark DataFrame with some examples.

There are many methods that you can use to import CSV file into pyspark or Spark DataFrame. But, the following methods are easy to use.

Read Local CSV using com.databricks.spark.csv Format
Run Spark SQL Query to Create Spark DataFrame

Now, let us check these methods in detail with some examples.

Read Local CSV using com.databricks.spark.csv Format

This is one of the easiest methods that you can use to import CSV into Spark DataFrame. But, this method is dependent on the “com.databricks:spark-csv_2.10:1.2.0” package.

Start PySpark by adding a dependent package. This is the mandatory step if you want to use com.databricks.spark.csv.

For example, execute the following line on command line interface to start the PySpark shell by adding a dependent package.

pyspark --packages com.databricks:spark-csv_2.10:1.2.0

Read CSV file using Spark CSV Package

Now, you have required packaged available. You can directly start importing CSV file.

Spark DataFrame Read CSV with Header

You can read CSV file with our without header. For example, consider following command to read CSV file with header.

>>> df = sqlContext.read.format( 'com.databricks.spark.csv' ).options( header='true' ).load( 'file:///home/vithal/sparkFiles/file1.csv' )
>>> df.show()
+---+-----+-----+
| id| name| city|
+---+-----+-----+
|  1|'ABC'|'BNG'|
|  2|'BCD'| 'ND'|
|  3|'CDE'|  'M'|
|  4|'DEF'|'BNG'|
+---+-----+-----+

Spark DataFrame Read CSV without Header

You can also read the CSV file without header row.

For example, consider following command to read CSV file without header.

>>> df = sqlContext.read.format( 'com.databricks.spark.csv' ).options( header='false' ).load( 'file:///home/vithal/sparkFiles/file2.csv' )
>>> df.show()
+---+-----+-----+
| C0|   C1|   C2|
+---+-----+-----+
|  1|'ABC'|'BNG'|
|  2|'BCD'| 'ND'|
|  3|'CDE'|  'M'|
|  4|'DEF'|'BNG'|
+---+-----+-----+

Read HDFS File using com.databricks.spark.csv Format

You can also read the CSV file which is stored in an HDFS directory.

For example, consider following example to read from HDFS CSV file.

>>> df2 = sqlContext.read.format( 'com.databricks.spark.csv' ).options( header='true' ).load( 'hdfs:///vithal/file1.csv' )
>>> df2.show()
+---+-----+-----+
| id| name| city|
+---+-----+-----+
|  1|'ABC'|'BNG'|
|  2|'BCD'| 'ND'|
|  3|'CDE'|  'M'|
|  4|'DEF'|'BNG'|
+---+-----+-----+

Note that, you may want to provide fully qualified HDFS file path if you are connecting from remote source or edge node.

Run Spark SQL Query to Create Spark DataFrame

If you have already created permanent or external table on the top of the CSV file, then you can simply execute query to load the content of that table into Spark DataFrame.

For example, consider following Spark SQL example.

>>> df = sqlContext.sql("select * from test1_new")
>>> df.show()
+---+-----+-----+
| id| name| city|
+---+-----+-----+
|  1|'ABC'|'BNG'|
|  2|'BCD'| 'ND'|
|  3|'CDE'|  'M'|
|  4|'DEF'|'BNG'|
+---+-----+-----+