DWgeek.com

Running SQL using Spark-SQL Command line Interface-CLI

In my other post, we have seen how to connect to Spark SQL using beeline jdbc connection. You can execute SQL queries in many ways, such as programmatically, use spark or pyspark shell, beeline jdbc client. Many does not know that spark supports spark-sql command line interface. You can use this to run hive metastore service in local mode. Related Article: Methods to access Hive Tables from Apache Spark Spark SQL Cumulative Sum Function and Examples What is Spark-SQL command line Interface (CLI)? The Spark SQL command line interface or…

Comments Off

March 13, 2019

BigData

Steps to Import Oracle Tables using Sqoop

Oracle database is one of the largely used database in the world. Most of financial organizations are using Oracle for their transaction processing. As mentioned in my other post import Netezza tables using Apache Sqoop, with growing data organizations are moving their computation part to Hadoop ecosystem. In this post, we will check steps to import Oracle tables using Sqoop commands. Steps to Import Oracle Tables using Sqoop Most of the organizations and people trying to get data into Hadoop ecosystem, they use various options such as creating flat-files and…

Comments Off

March 12, 2019

BigData

Sqoop Export Hive Tables into Netezza

Hadoop systems are mostly best suited for batch processing. Reporting is not recommended on Hadoop Hive or Impala. Sometimes to enable faster reporting, organizations transfer the processed data from Hadoop ecosystem to high performance relational databases such as Netezza. In this article, we will check Sqoop export Hive tables into Netezza with working examples. Sqoop Export Hive Tables into Netezza In some cases, data processed by Hadoop ecosystem may be needed in production systems hosted on relational databases to help run additional critical business functions and generate reports. The Sqoop can exports…

Comments Off

March 8, 2019

BigData

How to Import Netezza Tables using Sqoop?

With growing data, organizations are moving computation part to Hadoop ecosystem. Apache Sqoop is an open source tool to import data from relational databases to Hadoop and vice versa. Apache Sqoop is one of the easiest tool to import relational database such as Netezza into Hadoop ecosystem. The Sqoop command allows you to import all tables, single table, execute query and store result in Hadoop HDFS. In this article, we will check how to import Netezza tables using Sqoop with some practical examples. Sqoop uses a connector based architecture which…

Comments Off

March 7, 2019

Apache Spark

Steps to Connect Teradata Database from Spark – Examples

Apache Spark is one of the emerging bigdata technology, thanks to its fast and in memory distributed computation. You can analyze petabytes of data using the Apache Spark in memory distributed computation. You can connect Spark to all major databases in market such as Netezza, Oracle, etc. In this article, we will check one of methods to connect Teradata database from Spark program. You can connect using either Scala or Python Pyspark. For all examples in this article, we will use Scala to read Teradata tables. You can even execute…

Comments Off

March 5, 2019

Apache Spark

Steps to Connect Oracle Database from Spark – Examples

Apache Spark is one of the emerging bigdata technology, thanks to its fast and in memory distributed computation. You can analyze petabytes of data using the Apache Spark in memory distributed computation. In this article, we will check one of methods to connect Oracle database from Spark program. Preferably, we will use Scala to read Oracle tables. You can even execute queries and create Spark dataFrame. Steps to Connect Oracle Database from Spark Oracle database is one of the widely used databases in world. Almost all companies use Oracle as…

Comments Off

March 5, 2019

Apache Spark

How to Connect Netezza Server from Spark? – Example

I was working on one of the Spark project where we had a requirement to connect Netezza server from Spark. Integrating Netezza and Apache Spark enable analytic capabilities using the Apache Spark for data resides in Netezza database. There are various ways you can use to connect Netezza server from Spark program. You can also connect to the Netezza server from Pyspark. So, Why to Connect Netezza Server from Spark? This is an interesting question. To answer this, let us check how Apache Spark works. Apache Spark works on data…

Comments Off

March 4, 2019

Apache Spark

Basic Spark Transformations and Actions using pyspark

Apache Spark provides two kinds of operations: Transformations and Actions. We will check the commonly used basic Spark Transformations and Actions using pyspark. Create RDD from Local File You can use textFile spark context method to create RDD from local or HDFS file systems rdd = sc.textFile("file:////home/impadmin/test.txt") Related Articles: Apache Spark Architecture, Design and Overview Create RDD from HDFS File rdd = sc.textFile("hdfs:/localhost:8020/home/impadmin/test.txt") Basic Spark Transformations Transformations are Spark operation which will transform one RDD into another. Transformations will always create new RDD from original one. Below are some basic…

Comments Off

March 1, 2019

Apache Spark

Apache Spark SQL Introduction and Features

In an Apache Spark, Spark SQL is a module to work with structured and semi structured data. Any data that has schema is considered as structured data, for example, JSON, Hive tables, parquet file formats, etc. Whereas, semi structured data is something with no separation between the schema and the data. In this article, we will check Apache Spark SQL introduction and its features. Apache Spark SQL Introduction As mentioned earlier, Spark SQL is a module to work with structured and semi structured data. Spark SQL works well with huge…

Comments Off

February 28, 2019

BigData

Apache Hive User-defined Functions

Apache Hive is a data warehouse framework on top of Hadoop ecosystem. The Apache Hive architecture is different compared to other Hadoop tools that are available. Being an open source project, Apache Hive has added a lot of functionalities since its inception. But it still lacks some basic functionalities that are available in traditional data warehouse systems such as Netezza, Teradata, Oracle, etc. In this post, we will check Apache Hive user-defined functions and how to use them to perform a specific task. Apache Hive User-defined Functions When you start…

Comments Off

February 27, 2019