In Netezza, if two tables are distributed on same column then such tables are called collocated tables. If you join two collocated tables than each SPU in system works 100% independent to each other as the relevant rows required for joining these tables exist in the same SPU. These types of joins are called collocated joins.
Collocated Joins Example
Let’s take an examples, assume that we have two tables a patient and hospital, both are distributed on patient id. Netezza distributes the rows with same distribution key (column) to same data slice.
Read:
- Importance of right Netezza Distribution Key
- Netezza Joins and its Algorithms
- Netezza Hash Functions and Examples
- nzload Command and its Usage
Let’s say, we have inserted sample 5 records to each table. If you verify the distribution, Netezza should distribute rows with same Patient id to same data slice.
Patient table:
SELECT DATASLICEID,PAT_ID,HOSP_ID,PAT_NAME
FROM PAT_DTL
GROUP BY DATASLICEID,PAT_ID,HOSP_ID,PAT_NAME ;
DATASLICEID | PAT_ID | HOSP_ID | PAT_NAME |
2 | 1 | 111 | abc |
3 | 2 | 111 | bcd |
4 | 3 | 222 | cde |
5 | 4 | 111 | efg |
6 | 5 | 222 | fgh |
Hospital table:
SELECT DATASLICEID,PAT_ID,HOSP_ID,HOSP_NAME
FROM HOSP_DTL
GROUP BY DATASLICEID,PAT_ID,HOSP_ID,HOSP_NAME;
DATASLICEID | PAT_ID | HOSP_ID | HOSP_NAME |
2 | 1 | 111 | AAA |
3 | 2 | 111 | AAA |
4 | 3 | 222 | BBB |
5 | 4 | 111 | AAA |
6 | 5 | 222 | BBB |
If you join these two tables, each SPU in the Netezza server works 100% independent to other SPU as required data is available in itself as shown in above example.
In Collocated joins, each SPU operates independently without any network traffic or communication between SPU’s hence improves the performance of the query.
Collocated Joins, Redistribute and Broadcast
If you join two tables on columns which are not distribution key, then system has to redistribute either of the table on the join column to make collocated joins. This process of redistribution will impact the performance.
The decision about the re-distribution or broadcast is made by the optimizer and is based on the expected table size. If the table is large, Netezza will re-distribute otherwise broadcast. In the broadcast, Netezza host will send the copy of small table to each SPU thus making collocated joins
When we talk about the “Same” column does this mean a column named the same or does this mean that each table is distributed on a column within each table and then both those columns are used in the join.
For example.. assuming T1 and T2 are both distributed respectively on C1 and C2. Further, we assume that C1 and C2 are the same data type…. The would a join of the form
SELECT T1.C1,T2.C2
FROM T1
INNER JOIN T2 ON T1.C1 – T2.C2
Would this yield a co-located join?
Hi Tom,
If two tables are distributed on same column which has same data and if you join two tables on that column then tables will be co-located. For examples, consider T1 is distributed on C1 (say date_SK) and T2 is distributed on C2 (say same date_sk), both tables will be co-located if you join them on C1 and C2.
Thanks,