Design Slowly Changing Dimension Type 2 in SQL

  • Post author:
  • Post last modified:February 28, 2018
  • Post category:Data Warehouse
  • Reading time:24 mins read

Dimensions in data warehousing contain relatively static data about entities such as customers, stores, locations etc. Slowly changing dimensions commonly known as SCD, usually captures the data that changes slowly but unpredictably, rather than regular bases.  Slowly changing dimension type 2 is most popular method used in dimensional modelling to preserve historical data.

SCD Type2

For example, lets take the example of patient details. The fact table may contains the information about patient expense details. The fact and dimensions are always linked by means of foreign keys. One of the dimension may contain the information about patient (say, patient dimension ) . The patient information will not change on day to day bases. It’ll change once in a while (slowly changes).

If you want to implement the Slowly Changing Dimension Type 2 in SQL (without ETL tools), it’s gonna take bit complex route but you’ll end up with best feeling in world of implementing SCD Type 2.

Okay!. Let’s get started with building Slowly Changing Dimension Type 2 on patient dimension table.

Let us consider the patient details table which has attributes like SK or PAT_ID, Phone number, name etc. Here we need to capture the updated phone number or name and store that historical data in same table.

In data  warehouse environment, usually tables are divided into three tables viz. LOAD (just  a data dump) , ITERMEDIATE (transformed data) and TARGET (replica of TARGET) tables, will be referred as LOAD, INT and TGT henceforth.  In this SCD type 2 implementation , we will be using all these three tables.

Netezza version 7.2.1 or higher supports the Merge command, you can use that to implement the SCD Type 2 easily.

Read:

Below are the content of LOAD and TARGET table at the time building Slowly Changing Dimension Type 2 Type 2.

LOAD : PAT_DTLS_LOAD –  Hold new and updated data. Consider we received data as below and we have to load this data into patient dimension which has SCD Type2 implemented on it.

SK PHONE_NO NAME
123 987654321 Jhoney
234 876543210 Stuart
567 345678901 Max Stuart
456 765432101 Jeff

TARGET: PAT_DTLS_TGT- Hold actual patient data – This is actual initial load i.e. there are no patient who updated their data yet.

SK PHONE_NO NAME EFF_FR_DT EFF_TO_DT FLAG
345 345678901 Jhoney Stuart 2016-06-01 NULL 1
234 234567890 Stuart 2016-06-01 NULL 1
123 123456789 Jhoney 2016-06-01 NULL 1
567 345678901 Max Stuart 2016-06-01 NULL 1

INT: PAT_DTLS_LOAD_INT- Table to hold intermediate results

Now we are all set with tables and data. Let’s get started with detailed steps…

Step 1: Load expired records from TGT to INT table

Copy all expired records INT table . This step just copying all records with non-null EFF_TO_DT and flag = 0 records. Since there are no expired records already in database, hence there won’t be any records in intermediate table.

Query: Below query can be used to perform this task

INSERT INTO PAT_DTLS_LOAD_INT
SELECT * FROM PAT_DTLS_TGT
WHERE FLAG = 0 and EFF_TO_DATE is not null;

Result: In first run there will not be any expired records

SK PHONE_NO NAME EFF_FR_DT EFF_TO_DT FLAG

Step 2: Get all records which are going to expire

These are the records that are updated  and are recieved in LOAD table. Since these records are expired, You also need set the EFF_TO_DT to the ‘current_date-1’ that means record was active till yesterday and flag as ‘0’. The updated records will be active from the moment we load them so EFF_FR_DT will be current date

Query: Below query is very helpful to perform this task

INSERT into PAT_DTLS_LOAD_INT
SELECT TGT.SK ,
TGT.PHONE_NO,
TGT.NAME,
EFF_FR_DT,
now() – 1 as EFF_TO_DT,
‘0’ as flag
FROM PAT_DTLS_TGT TGT
WHERE TGT.flag = 1
AND EXISTS (SELECT 1 FROM
PAT_DTLS_LOAD SRC
WHERE TGT.sk = src.sk
and (
TGT.PHONE_NO <> src.PHONE_NO
or TGT.NAME <> src.NAME )
);

Result: Results shows expired records. Updated version of these records are available in LOAD table

SK PHONE_NO NAME EFF_FR_DT EFF_TO_DT FLAG
123 123456789 Jhoney 2016-06-01 2016-06-12 0
234 234567890 Stuart 2016-06-01 2016-06-12 0

Step 3: Copy active records from TGT to INT table

These are the records which are active and not updated. Since target table is going to be the truncate and load, You will have to copy all active records to intermediate table.

Query: Below query can perform the mentioned task

INSERT INTO PAT_DTLS_LOAD_INT
SELECT TGT.SK ,
TGT.PHONE_NO,
TGT.NAME,
EFF_FR_DT,
EFF_TO_DT,
FLAG
FROM PAT_DTLS_TGT TGT
WHERE FLAG = 1
AND NOT EXISTS (SELECT 1 FROM
PAT_DTLS_LOAD SRC
WHERE TGT.SK = SRC.SK )
UNION ALL –Include unchaged records
SELECT TGT.SK ,
TGT.PHONE_NO,
TGT.NAME,
EFF_FR_DT,
EFF_TO_DT,
FLAG
FROM PAT_DTLS_TGT TGT
WHERE TGT.FLAG = 1
AND EXISTS (SELECT 1 FROM
PAT_DTLS_LOAD SRC
WHERE TGT.SK = SRC.SK
AND (
TGT.PHONE_NO = SRC.PHONE_NO
AND TGT.NAME = SRC.NAME )
);

Results: Active records to be copied form TGT table is higlighted. These are the records which not updated in this load  cycle

SK PHONE_NO NAME EFF_FR_DT EFF_TO_DT FLAG
123 123456789 Jhoney 2016-06-01 2016-06-12 0
234 234567890 Stuart 2016-06-01 2016-06-12 0
345 345678901 Jhoney Stuart 2016-06-01 NULL 1
567 345678901 Max Stuart 2016-06-01 NULL 1

 Design Slowly Changing Dimension Type 2 in SQL Cont…

This Post Has 2 Comments

  1. krishnendu

    I have a requirement to load historical data (for past 2 years) in the Sales fact table.Shall i load this historical data into the same fact table or do i have to create a duplicate fact table to load the historical data alone.
    What will be the best solution.

    Thanks in advance !

    1. Vithal S

      Hi,

      If you are using that historic data in any report or KPI then better to load in single fact table. It is not recommended to maintain multiple fact tables to hold similar kind of data.

      Thanks

Comments are closed.