Collocated Joins in Netezza: Optimizing Query Performance

  • Post author:
  • Post last modified:April 24, 2019
  • Post category:Netezza
  • Reading time:4 mins read

In Netezza, if two tables are distributed on same column then such tables are called collocated tables. If you join two collocated tables than each SPU in system works 100% independent to each other as the relevant rows required for joining these tables exist in the same SPU. These types of joins are called collocated joins.

Collocated Joins

Collocated Joins Example

Let’s take an examples, assume that we have two tables a patient and hospital, both are distributed on patient id. Netezza distributes the rows with same distribution key (column) to same data slice.

Read:

Let’s say, we have inserted sample 5 records to each table. If you verify the distribution, Netezza should distribute rows with same Patient id to same data slice.

Patient table:

SELECT DATASLICEID,PAT_ID,HOSP_ID,PAT_NAME
FROM PAT_DTL
GROUP BY DATASLICEID,PAT_ID,HOSP_ID,PAT_NAME ;
DATASLICEID PAT_ID HOSP_ID PAT_NAME
2 1 111 abc
3 2 111 bcd
4 3 222 cde
5 4 111 efg
6 5 222 fgh

Hospital table:

SELECT DATASLICEID,PAT_ID,HOSP_ID,HOSP_NAME
FROM HOSP_DTL
GROUP BY DATASLICEID,PAT_ID,HOSP_ID,HOSP_NAME;
DATASLICEID PAT_ID HOSP_ID HOSP_NAME
2 1 111 AAA
3 2 111 AAA
4 3 222 BBB
5 4 111 AAA
6 5 222 BBB

If you join these two tables, each SPU in the Netezza server works 100% independent to other SPU as required data is available in itself as shown in above example.

In Collocated joins, each SPU operates independently without any network traffic or communication between SPU’s hence improves the performance of the query.

Collocated Joins, Redistribute and Broadcast

If you join two tables on columns which are not distribution key, then system has to redistribute either of the table on the join column to make collocated joins. This process of redistribution will impact the performance.

The decision about the re-distribution or broadcast is made by the optimizer and is based on the expected table size. If the table is large, Netezza will re-distribute otherwise broadcast. In the broadcast, Netezza host will send the copy of small table to each SPU thus making collocated joins

This Post Has 2 Comments

  1. Tom Harrocks

    When we talk about the “Same” column does this mean a column named the same or does this mean that each table is distributed on a column within each table and then both those columns are used in the join.
    For example.. assuming T1 and T2 are both distributed respectively on C1 and C2. Further, we assume that C1 and C2 are the same data type…. The would a join of the form
    SELECT T1.C1,T2.C2
    FROM T1
    INNER JOIN T2 ON T1.C1 – T2.C2

    Would this yield a co-located join?

    1. Vithal S

      Hi Tom,

      If two tables are distributed on same column which has same data and if you join two tables on that column then tables will be co-located. For examples, consider T1 is distributed on C1 (say date_SK) and T2 is distributed on C2 (say same date_sk), both tables will be co-located if you join them on C1 and C2.

      Thanks,

Comments are closed.