Extraction is the first step of the ETL(Extract, Transform and Load) process. Once the data is extracted, you can transform it and load to target data warehouse. Extraction is the process of extracting data from the source system for further use in the data warehouse environment.
Related Reading:
- Data Warehouse Fact Constellation Schema and Design
- Star Schema model in Data Warehouse
- Snowflake Schem Model in Data Warehouse
Data warehouse is an OLAP system, typically source system includes the transaction business processing application. For example, it could be sales order entry system which has order details.
Designing and creating an extraction process is often most important and time consuming task in the data warehouse environment. This is because source system might be complex system and requires us to extract the data several times to keep the up-to date data in the data warehouse environment.
Whatever data warehouse extraction methods you choose, is dependent on the source system and business needs in the target data warehouse environment.
Types of Data Warehouse Extraction Methods
There are two types of data warehouse extraction methods: Logcal and Physical extraction methods.
Logical Extraction
Logical Extraction method in-turn has two methods:
Full Extraction
In this method, data is completly extracted from the source system. The source data will be provided as-is and no additional logical information is necessary on the source system. Since it is complete extraction, so no need to track source system for changes.
For example, exporting complete table in the form of flat file.
Incremental Extraction
In incremental extraction, the changes in source data need to be tracked since the last successful extraction. Only these changes in data will be extracted and then loaded. Identifying the last changed data itself is the complex process and involve many logic.
You can detect the changes in the source system from the spicific column in the source system that has the last changed timestamp. You can also create a change table in the source system, which keeps track of the changes in the source data.
Many Data warehouse system do not use change-capture technique. Instead they extract the entire table from the source system into stage area and compare the data with previous version table and identify the data which has changed.
Physical Extraction
Physical extraction has two methods: Online and Offline extraction:
Online Extraction
In this process, extraction process directly connect to the source system and extract the source data.
Offline Extraction
The data is not extracted directly from the source system but is staged explicitly outside the original source system.
You can consider the following common structure in offline extraction:
- Flat file: Generic format
- Dump file: Database specific file
Data Warehouse Extraction Methods Examples
Here are some of Netezza specific Extraction methods:
Extract into Flat file using External Tables: You can use the external table to extract in flat files.
Dump File: Netezza can also provides the dump of the table. Later, you can use the nzload to load data into data warehouse
Flat File using nzsql: You can create the file file with nzsql with -o option.
Awesome post.
Thank you 🙂