A data warehouse, also known as an enterprise data warehouse (EDW), is a large collective store of data that is used to make such data-driven decisions, thereby becoming one of the centrepiece of an organization’s data infrastructure. Hadoop Data Warehouse was challenge in initial days when Hadoop was evolving but now with lots of improvement, it is very easy to develop Hadoop data warehouse Architecture. This article will server as a guide to Hadoop data warehouse system design.
Hadoop data warehouse integration is now a days become very much popular and many companies are working on the migration tools. In this article, we will check the Hadoop data warehouse example with an architecture design.
High-Level Hadoop Data Warehouse Architecture
Below given the high-level design Hadoop data warehouse architecture:
Hadoop Data Warehouse Architecture Explanation
Extract Data From Sources
We have the operational source system such as traditional OLTP database systems. Since it is Hadoop ecosystem, you may also introduce the multi-structured data such as weblogs, machine log data, social media feeds including Facebook, twitter, linkedIn etc.
You can use Sqoop as an ingestion mechanism if you are importing data from the tradition OLTP database systems. You can use the Sqoop to export data back to OLTP systems once the processing is done in Hadoop ecosystem.
Read:
Transformation
Source data will be ingested directly into HDFS before being transformed and loaded into target systems in designated directories. Transformations will occur through one of the processing frameworks supported on Hadoop, such as MapReduce, Spark, Hive, Impala or Pig etc.
Load Data to Target Systems
Once transformed, data will be moved into target systems. In our example, this includes hosting the data in Hadoop via a SQL-like interface such as Hive or Impala. Additionally, you can export processed data into other data warehouse for further analysis. You can use the Sqoop export to transfer the data back to OLTP data warehouse systems. You can use sqoop to build Hadoop data warehouse ETL process with help of shell or Python scripting.
Additionally, you can access the data directly from the HDFS schema using various analytical tools such as BI, analytics, or visualisation tools.
Hadoop Data Warehouse Advantages
Hadoop can help to overcome some of challenges that traditional data warehouse systems are facing now:
- It can process the large volume and complex data. It can complete the ETL processing within required time constraints.
- Hadoop can process the large volume and complex data. Its distributed workload system can reduce the excessive load on the systems.
- Hadoop is flexible
- It is cheap compared to traditional data warehouse systems.
Hadoop Data Warehouse Challenges
There are few Hadoop data warehouse challenges:
- If you put data on Hadoop ecosystem that can provide an access to potentially valuable data that might otherwise never be available in traditional data warehouse ecosystem.
- Flexibility of Hadoop allows for evolving schemas and handling semi-structured and unstructured data, which enables fast turnaround time when changes to downstream reports or schemas happen.
- Using Hadoop as an online archive can free up valuable space and resources in the data warehouse, and avoid the need for expensive scaling of the data warehouse architecture.
Where you can build Hadoop Data Warehouse?
You can build Hadoop Data warehouse in Hive or Impala. Being MPP, Impala gives you better performance compared to Hive.