Data Warehouse Surrogate Key Design – Advantages and Disadvantages

If you are working on Data warehouse project, than you might have heard lot about surrogate keys. Surrogate keys are widely accepted data warehouse design standard. In this article, we will check data warehouse surrogate key design, advantages and disadvantages.

Data Warehouse Surrogate Key

What are surrogate keys in Data warehouse?

If you are a data warehouse developer, that you might be thinking what is surrogate key? How and where it is being used? You will get answers to all your questions here.

Data warehouse surrogate keys are sequentially generated meaningless numbers associated with each and every record in the data warehouse. These surrogate keys are used to join dimension and fact tables.

  • Usually, database sequences are used to generate surrogate key so it is always unique number
  • Surrogate keys cannot be NULLs. Surrogate key are never populated with NULL values.
  • It does not hold any meaning in data warehouse, often called meaningless numbers. It is just sequentially generated INTEGER number for better lookup and faster joins.

Why surrogate keys are used in Data warehouse?

Basically, surrogate key is an artificial key that is used as a substitute for natural key (NK) defined in data warehouse tables. We can use natural key or business keys as a primary key for tables. However, it is not recommended because of following reasons:

  • Natural keys (NK) or Business keys are generally alphanumeric values that is not suitable for index as traversing become slower. For example, prod123, prod231 etc
  • Business keys are often reused after sometime. It will cause the problem as in data warehouse we maintain historic data as well as current data.

For example, product codes can be revised and reused after few years. It will become difficult to differentiate current products and historic products. To avoid such a situation, surrogate keys are used.

Data Warehouse Surrogate Key examples

Surrogate Keys are integers that are assigned sequentially in the dimension table which can be used as primary key. The surrogate key column could be identity column or database sequences are used.

Below is the sample example of surrogate key:

PATIENT_SK PATIENT_ID PATIENT_NAME PATIENT_AGE
1 P001 ABC 20
2 P002 BCD 25
3 P003 CDE 19
4 P004 DEF 45

Advantages of Surrogate Key

Below are some of advantages of using surrogate keys in data warehouse:

  • With help of surrogate keys, you can integrate heterogeneous data sources to data warehouse if they don’t have natural or business keys.
  • Joining tables (fact and dimensions) using surrogate key is faster hence better performance
  • Surrogate keys are very helpful for ETL transformations.
  • Data warehouse Surrogate keys are usually small integer numbers that makes smaller index and better performance
  • Surrogate keys are required if you are implementing slowly changing dimension (SCD)

Disadvantages of Surrogate Key

Below are some of disadvantages of using surrogate keys in data warehouse:

  • Surrogate key generation and assignment takes unnecessary burden on ETL framework
  • You should not over use the surrogate keys as they don’t have any meaning in data warehouse tables.
  • Data migration becomes difficult if you have database sequence associated with surrogate key columns. You should carefully take care of number surrogate key generation in new database otherwise you may end up with duplicate surrogate keys.

Related articles

This Post Has 2 Comments

  1. rahul

    Your way of explaining surrogate keys is way more better than wikipedia.
    Good job done sir 🙂

    1. Vithal S

      Thank you Rahul 🙂

Comments are closed.