Running SQL using Spark-SQL Command line Interface-CLI

  • Post author:
  • Post last modified:August 16, 2019
  • Post category:Apache Spark
  • Reading time:7 mins read

In my other post, we have seen how to connect to Spark SQL using beeline jdbc connection. You can execute SQL queries in many ways, such as programmatically,  use spark or pyspark shell, beeline jdbc client. Many does not know that spark supports spark-sql command line interface. You can use this to run hive metastore service in local mode.

Running SQL using Spark-SQL Command line Interface (CLI)

Related Article:

What is Spark-SQL command line Interface (CLI)?

The Spark SQL command line interface or simply CLI is a convenient tool to run the Hive metastore service in local mode and execute queries input from the command line. Note that, the Spark SQL command line interface or CLI cannot talk to the Thrift JDBC server.

How to start Spark-SQL CLI?

By default, if you want to connect to Hive metastore, you must have to configure Hive. Configuration of Hive is done by placing your hive-site.xml file in conf/ folder of installation directory.

To start Spark CLI, use below command:

./bin/spark-sql

Once the connection is successful, you will see spark-sql command prompt looks something like;

spark-sql>

You can execute all Hive queries from this CLI.

For examples,

spark-sql> select c_customer_ID, c_first_name from employees limit 10;
19/03/13 15:26:49 INFO ParseDriver: Parsing command: select c_customer_ID, c_first_name from employees limit 10
...
...
19/03/13 15:27:14 INFO DAGScheduler: Job 6 finished: processCmd at CliDriver.java:376, took 0.351280 s
AAAAAAAADAAAAAAA        Latisha
AAAAAAAAHAAAAAAA        Fonda
AAAAAAAALAAAAAAA        Betty
AAAAAAAAPAAAAAAA        Tonya
AAAAAAAADBAAAAAA        Andre
....

Spark-SQL command line Interface (CLI) Options

Below are the Spark-SQL CLI options:

Options Description
CLI options:
-d,–define <key=value> Variable subsitution to apply to hive commands.
–database <databasename> Specify Database to use
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,–help Print help information
–hiveconf <property=value> Use value for given property
–hivevar <key=value> Variable subsitution to apply to hive
-i <filename> Initialization SQL file
-S,–silent Silent mode in interactive shell
-v,–verbose Verbose mode
–master MASTER_URL Spark master url, yarn or local mode
–deploy-mode DEPLOY_MODE Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). Default is client.
–class CLASS_NAME Application main CLASS
–name NAME Application Name
–jars JARS Comma separated list of jars to be included with driver and execute classpath
–packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths
–exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in –packages to avoid dependency conflicts
–repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with –packages
–py-files PY_FILES Comma separated list of files to place of PYTHONPATH
–files FILES Comma-separated list of files to be placed in the working directory of each executor.
–conf PROP=VALUE Spark configuration property.
–properties-file FILE Path to a file from which to load extra properties
–driver-memory MEM Memory for driver. Default 1024MB
–driver-java-options Extra Java options to pass to the driver.
–driver-library-path Extra library path entries to pass to the driver.
–driver-class-path Extra class path entries to pass to the driver
–executor-memory MEM Memory per executor. Default 1GB.
–proxy-user NAME User to impersonate when submitting the application.
–help, -h Show this help message and exit
–verbose, -v Print additional debug output
–version Print Version
Spark standalone with cluster deploy mode only
–driver-cores NUM Cores for driver (Default: 1)
–supervise If given, restarts the driver on failure.
Spark standalone or Mesos with cluster deploy mode only:
–supervise If given, restarts the driver on failure.
–kill SUBMISSION_ID If given, kills the driver specified.
–status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
–total-executor-cores NUM  Total cores for all executors.
Spark standalone and YARN only:
–executor-cores NUM Number of cores per executor.
YARN-only Options:
–driver-cores NUM Number of cores used by the driver, only in cluster mode. Default 1.
–queue QUEUE_NAME The YARN queue to submit.
–num-executors NUM Number of executors to launch (Default: 2)
–archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
–principal PRINCIPAL Principal to be used to login to KDC, while running on secure HDFS.
–keytab KEYTAB The full path to the file that contains the keytab for the principal specified above

Related Articles

Hope this helps 🙂