Apache Spark SQL Introduction and Features

In an Apache Spark, Spark SQL is a module to work with structured and semi structured data. Any data that has schema is considered as structured data, for example, JSON, Hive tables, parquet file formats, etc. Whereas, semi structured data is something with no separation between the schema and the data. In this article, we will check Apache Spark SQL introduction and its features.

Apache Spark SQL Introduction

As mentioned earlier, Spark SQL is a module to work with structured and semi structured data. Spark SQL works well with huge amount of data as it supports distributed in-memory computations. You can either create tables in Spark warehouse or connect to Hive metastore and read hive tables.

You can read data from json file or from Hive tables. Spark SQL stores results sets as a dataFrames. You can interact with Spark SQL using command line interface or Spark shell. Hadoop distributions like HDP provides a beeline interface to interact with Hive tables using Spark.

Spark SQL provides three main capabilities for using structured and semi structured data:

Spark SQL provides dataFrames as an abstraction to Python, Java or Scala. DataFrames simplify working with structured or semi structured datasets. DataFrames are similar to tables in traditional relational databases.
It can read data from Hive metastires, JSON and Parquet file formats.
You can execute your queries directly on Spark SQL interface or using JDBC connectors.

Access Hive Tables from Apache Spark

For large scale data warehouse systems working with petabytes of data, it is possible to reduce the time taken to process the aggregate function by executing complex queries in an efficient manner. Spark SQL is one of the options that you can use to process large amount of data sets. Spark SQL has distributed in-memory computation and has an in-built catalyst optimizer that processes all types of queries at a faster pace. You can join external dataFrame with Hive tables.

To establish connection between Spark SQL and Hive metastore, simply copy the hive_stie.xml file to spark installation directory. Hive_site.xml file will be available in the hive installation directory.

Spark SQL Basic Query Examples

To query any Hive table, we call the sql() method on either the HiveContext or SQLContext. In Spark shell, hivecontext will be created as sqlContext. We can directly use this to execute any query on Spark.

Python code example:

>>> qry = "select 1"
>>> results = sqlContext.sql(qry)
>>> results.show()
+---+
|_c0|
+---+
|  1|
+---+

DataFrame in Spark

DataFrames are like tables in a traditional relational database. A DataFrame is an RDD of Row objects. A DataFrame in Spark also aware of the schema of each of its rows. DataFrames store data in a more efficient manner than native RDDs by taking advantage schema.

Loading and Accessing Data using Spark SQL

Spark SQL supports various structured data sources. Data source includes Hive tables, JSON, Parquet files, etc.

In addition, Spark SQL also provides data source API which allows various data source integrations. Data source includes Avro, Apache HBase etc.

Spark SQL JDBC Connection

Spark also provides JDBC driver that you can use to connect to Hive tables or any business intelligence tools.

Note that, CDH does not have Spark JDBC driver. You must create Spark context to connect to Hive tables.

Spark SQL Features

Below are some of important Spark SQL features:

Heterogeneous Data Source

Spark SQL can integrate data from various data sources such as Hive, JSON, Parquet file formats. Using Spark SQL, you can query structured and semi structured data sets.

Unified Data Access

You can use DataFrame and SQL to combine results of various data sources.

Query Compatibility

Hive queries are highly compatible with Spark SQL. You can execute various Hive queries without changing them.

Industry Standard Connectors

Spark SQL supports JDBC/ODBC connectors. You can use these connectors to get data from various data sources.

Hope this helps 🙂