Data plays important role in today’s decision making process. Be it online bookstore, e-commerce website or online food delivery applications use user data to provide better customer service. These are many organizations that share data to decision making systems. These companies provide data in the form of flat files or direct access to the source system. Many companies use Spark as an execution engine. In this article, we will check how to export Spark-SQL results to CSV flat file. The created flat files or CSV files then be transported using any mode of electronic transport such as email, FTP, SFTP, etc.
The decision makers analyze the flat files provided to them. Saving results of a query or a database table in any relational database or Hive is one of the basic and important work.
Note that, examples demonstrated in this articles are tested using pyspark.
How to Export Spark-SQL Results to CSV?
There are many methods that you can use to export Spark-SQL table Results into a flat file. I will discuss commonly used methods in this article.
- Convert Spark DataFrame to pandas DataFrame and save to CSV
- Use CSV Data Source to Export Spark DataFrame to CSV
Convert Spark DataFrame to pandas DataFrame and save to CSV
This is one of the easiest methods that you can follow to export Spark SQL results to flat file or excel format (csv).
If Spark DataFrame fits on a Spark driver memory and you want to save to local file system you can convert Spark DataFrame to local Pandas DataFrame using Spark toPandas method and then simply use to_csv.
Note that, you should have pandas installed on your python distribution to use this method.
sparkdf.toPandas().to_csv('test.csv')
CSV Data Source to Export Spark DataFrame to CSV
Another easiest method is to use spark csv data source to save your Spark dataFrame content to local CSV flat file format.
sparkdf.write.csv('test.csv')
Note that, Spark csv data source support is available in Spark version 2.0 and above.
In Spark 1.3, use below method:
sparkdf.save('test.csv', 'com.databricks.spark.csv')
In Spark 1.4 and above, use below method:
sparkdf.write.format('com.databricks.spark.csv').save('test.csv')
Related Articles,
- Basic Spark Transformations and Actions using pyspark
- How to Update Spark DataFrame Column Values using Pyspark?
- Running SQL using Spark-SQL Command line Interface-CLI
- Import CSV file to Pyspark DataFrame – Example
- How to Access Azure Blob Storage Files from Databricks?
Hope this helps 🙂