Dimensions in data warehousing contain relatively static data about entities such as customers, stores, locations etc. Slowly changing dimensions commonly known as SCD, usually captures the data that changes slowly but unpredictably, rather than regular bases. Slowly changing dimension type 2 is most popular method used in dimensional modelling to preserve historical data.
For example, lets take the example of patient details. The fact table may contains the information about patient expense details. The fact and dimensions are always linked by means of foreign keys. One of the dimension may contain the information about patient (say, patient dimension ) . The patient information will not change on day to day bases. It’ll change once in a while (slowly changes).
If you want to implement the Slowly Changing Dimension Type 2 in SQL (without ETL tools), it’s gonna take bit complex route but you’ll end up with best feeling in world of implementing SCD Type 2.
Okay!. Let’s get started with building Slowly Changing Dimension Type 2 on patient dimension table.
Let us consider the patient details table which has attributes like SK or PAT_ID, Phone number, name etc. Here we need to capture the updated phone number or name and store that historical data in same table.
In data warehouse environment, usually tables are divided into three tables viz. LOAD (just a data dump) , ITERMEDIATE (transformed data) and TARGET (replica of TARGET) tables, will be referred as LOAD, INT and TGT henceforth. In this SCD type 2 implementation , we will be using all these three tables.
Netezza version 7.2.1 or higher supports the Merge command, you can use that to implement the SCD Type 2 easily.
Read:
- Impala or Hive Slowly Changing Dimension – SCD Type 2 Implementation
- Netezza MERGE command to Manipulate Records from Table
Below are the content of LOAD and TARGET table at the time building Slowly Changing Dimension Type 2 Type 2.
LOAD : PAT_DTLS_LOAD – Hold new and updated data. Consider we received data as below and we have to load this data into patient dimension which has SCD Type2 implemented on it.
SK | PHONE_NO | NAME |
123 | 987654321 | Jhoney |
234 | 876543210 | Stuart |
567 | 345678901 | Max Stuart |
456 | 765432101 | Jeff |
TARGET: PAT_DTLS_TGT- Hold actual patient data – This is actual initial load i.e. there are no patient who updated their data yet.
SK | PHONE_NO | NAME | EFF_FR_DT | EFF_TO_DT | FLAG |
345 | 345678901 | Jhoney Stuart | 2016-06-01 | NULL | 1 |
234 | 234567890 | Stuart | 2016-06-01 | NULL | 1 |
123 | 123456789 | Jhoney | 2016-06-01 | NULL | 1 |
567 | 345678901 | Max Stuart | 2016-06-01 | NULL | 1 |
INT: PAT_DTLS_LOAD_INT- Table to hold intermediate results
Now we are all set with tables and data. Let’s get started with detailed steps…
Step 1: Load expired records from TGT to INT table
Copy all expired records INT table . This step just copying all records with non-null EFF_TO_DT and flag = 0 records. Since there are no expired records already in database, hence there won’t be any records in intermediate table.
Query: Below query can be used to perform this task
INSERT INTO PAT_DTLS_LOAD_INT
SELECT * FROM PAT_DTLS_TGT
WHERE FLAG = 0 and EFF_TO_DATE is not null;
Result: In first run there will not be any expired records
SK | PHONE_NO | NAME | EFF_FR_DT | EFF_TO_DT | FLAG |
Step 2: Get all records which are going to expire
These are the records that are updated and are recieved in LOAD table. Since these records are expired, You also need set the EFF_TO_DT to the ‘current_date-1’ that means record was active till yesterday and flag as ‘0’. The updated records will be active from the moment we load them so EFF_FR_DT will be current date
Query: Below query is very helpful to perform this task
INSERT into PAT_DTLS_LOAD_INT
SELECT TGT.SK ,
TGT.PHONE_NO,
TGT.NAME,
EFF_FR_DT,
now() – 1 as EFF_TO_DT,
‘0’ as flag
FROM PAT_DTLS_TGT TGT
WHERE TGT.flag = 1
AND EXISTS (SELECT 1 FROM
PAT_DTLS_LOAD SRC
WHERE TGT.sk = src.sk
and (
TGT.PHONE_NO <> src.PHONE_NO
or TGT.NAME <> src.NAME )
);
Result: Results shows expired records. Updated version of these records are available in LOAD table
SK | PHONE_NO | NAME | EFF_FR_DT | EFF_TO_DT | FLAG |
123 | 123456789 | Jhoney | 2016-06-01 | 2016-06-12 | 0 |
234 | 234567890 | Stuart | 2016-06-01 | 2016-06-12 | 0 |
Step 3: Copy active records from TGT to INT table
These are the records which are active and not updated. Since target table is going to be the truncate and load, You will have to copy all active records to intermediate table.
Query: Below query can perform the mentioned task
INSERT INTO PAT_DTLS_LOAD_INT
SELECT TGT.SK ,
TGT.PHONE_NO,
TGT.NAME,
EFF_FR_DT,
EFF_TO_DT,
FLAG
FROM PAT_DTLS_TGT TGT
WHERE FLAG = 1
AND NOT EXISTS (SELECT 1 FROM
PAT_DTLS_LOAD SRC
WHERE TGT.SK = SRC.SK )
UNION ALL –Include unchaged records
SELECT TGT.SK ,
TGT.PHONE_NO,
TGT.NAME,
EFF_FR_DT,
EFF_TO_DT,
FLAG
FROM PAT_DTLS_TGT TGT
WHERE TGT.FLAG = 1
AND EXISTS (SELECT 1 FROM
PAT_DTLS_LOAD SRC
WHERE TGT.SK = SRC.SK
AND (
TGT.PHONE_NO = SRC.PHONE_NO
AND TGT.NAME = SRC.NAME )
);
Results: Active records to be copied form TGT table is higlighted. These are the records which not updated in this load cycle
SK | PHONE_NO | NAME | EFF_FR_DT | EFF_TO_DT | FLAG |
123 | 123456789 | Jhoney | 2016-06-01 | 2016-06-12 | 0 |
234 | 234567890 | Stuart | 2016-06-01 | 2016-06-12 | 0 |
345 | 345678901 | Jhoney Stuart | 2016-06-01 | NULL | 1 |
567 | 345678901 | Max Stuart | 2016-06-01 | NULL | 1 |
I have a requirement to load historical data (for past 2 years) in the Sales fact table.Shall i load this historical data into the same fact table or do i have to create a duplicate fact table to load the historical data alone.
What will be the best solution.
Thanks in advance !
Hi,
If you are using that historic data in any report or KPI then better to load in single fact table. It is not recommended to maintain multiple fact tables to hold similar kind of data.
Thanks