Running SQL using Spark-SQL Command line Interface-CLI

In my other post, we have seen how to connect to Spark SQL using beeline jdbc connection. You can execute SQL queries in many ways, such as programmatically, use spark or pyspark shell, beeline jdbc client. Many does not know that spark supports spark-sql command line interface. You can use this to run hive metastore service in local mode.

What is Spark-SQL command line Interface (CLI)?

The Spark SQL command line interface or simply CLI is a convenient tool to run the Hive metastore service in local mode and execute queries input from the command line. Note that, the Spark SQL command line interface or CLI cannot talk to the Thrift JDBC server.

How to start Spark-SQL CLI?

By default, if you want to connect to Hive metastore, you must have to configure Hive. Configuration of Hive is done by placing your hive-site.xml file in conf/ folder of installation directory.

To start Spark CLI, use below command:

./bin/spark-sql

Once the connection is successful, you will see spark-sql command prompt looks something like;

spark-sql>

You can execute all Hive queries from this CLI.

For examples,

spark-sql> select c_customer_ID, c_first_name from employees limit 10;
19/03/13 15:26:49 INFO ParseDriver: Parsing command: select c_customer_ID, c_first_name from employees limit 10
...
...
19/03/13 15:27:14 INFO DAGScheduler: Job 6 finished: processCmd at CliDriver.java:376, took 0.351280 s
AAAAAAAADAAAAAAA        Latisha
AAAAAAAAHAAAAAAA        Fonda
AAAAAAAALAAAAAAA        Betty
AAAAAAAAPAAAAAAA        Tonya
AAAAAAAADBAAAAAA        Andre
....

Spark-SQL command line Interface (CLI) Options

Below are the Spark-SQL CLI options:

Options	Description
CLI options:
-d,–define <key=value>	Variable subsitution to apply to hive commands.
–database <databasename>	Specify Database to use
-e <quoted-query-string>	SQL from command line
-f <filename>	SQL from files
-H,–help	Print help information
–hiveconf <property=value>	Use value for given property
–hivevar <key=value>	Variable subsitution to apply to hive
-i <filename>	Initialization SQL file
-S,–silent	Silent mode in interactive shell
-v,–verbose	Verbose mode
–master MASTER_URL	Spark master url, yarn or local mode
–deploy-mode DEPLOY_MODE	Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). Default is client.
–class CLASS_NAME	Application main CLASS
–name NAME	Application Name
–jars JARS	Comma separated list of jars to be included with driver and execute classpath
–packages	Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths
–exclude-packages	Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in –packages to avoid dependency conflicts
–repositories	Comma-separated list of additional remote repositories to search for the maven coordinates given with –packages
–py-files PY_FILES	Comma separated list of files to place of PYTHONPATH
–files FILES	Comma-separated list of files to be placed in the working directory of each executor.
–conf PROP=VALUE	Spark configuration property.
–properties-file FILE	Path to a file from which to load extra properties
–driver-memory MEM	Memory for driver. Default 1024MB
–driver-java-options	Extra Java options to pass to the driver.
–driver-library-path	Extra library path entries to pass to the driver.
–driver-class-path	Extra class path entries to pass to the driver
–executor-memory MEM	Memory per executor. Default 1GB.
–proxy-user NAME	User to impersonate when submitting the application.
–help, -h	Show this help message and exit
–verbose, -v	Print additional debug output
–version	Print Version
Spark standalone with cluster deploy mode only
–driver-cores NUM	Cores for driver (Default: 1)
–supervise	If given, restarts the driver on failure.
Spark standalone or Mesos with cluster deploy mode only:
–supervise	If given, restarts the driver on failure.
–kill SUBMISSION_ID	If given, kills the driver specified.
–status SUBMISSION_ID	If given, requests the status of the driver specified.
Spark standalone and Mesos only:
–total-executor-cores NUM	Total cores for all executors.
Spark standalone and YARN only:
–executor-cores NUM	Number of cores per executor.
YARN-only Options:
–driver-cores NUM	Number of cores used by the driver, only in cluster mode. Default 1.
–queue QUEUE_NAME	The YARN queue to submit.
–num-executors NUM	Number of executors to launch (Default: 2)
–archives ARCHIVES	Comma separated list of archives to be extracted into the working directory of each executor.
–principal PRINCIPAL	Principal to be used to login to KDC, while running on secure HDFS.
–keytab KEYTAB	The full path to the file that contains the keytab for the principal specified above

How to Export Spark-SQL Results to CSV?

Hope this helps 🙂