Cloudera Impala Generate Sequence Numbers without UDF

  • Post author:
  • Post last modified:February 28, 2018
  • Post category:BigData
  • Reading time:2 mins read

If you are migrating from traditional database to Cloudera Impala then you might have noticed there is not sequence number function. In the process of Cloudera Impala Generate Sequence Numbers without UDF, you can use analytical function that are available in Cloudera Impala.

cloudera impala generate sequence numbers

If you want generate sequential sequences that automatically keep in sync with your table sequence number, you can do so with the help of Cloudera impala supported ROW_NUMBER analytical function.

Related reading:

Cloudera Impala Generate Sequence Numbers Example

Here is the example that will help you to generate Sequence number without using any UDF:

[quickstart.cloudera:21000:21000] > select tab2.max_seq + 1 + row_number() over (order by tab1.id) seq_num, tab1.*
from test tab1
 inner join (select max(id) max_seq from test) tab2;
Query: select tab2.max_seq + 1 + row_number() over (order by tab1.id) seq_num, tab1.*
from test tab1
 inner join (select max(id) max_seq from test) tab2
+---------+----+------+
| seq_num | id | name |
+---------+----+------+
| 10 | 1 | abc |
| 11 | 2 | bcd |
| 12 | 3 | cde |
| 13 | 4 | def |
| 14 | 5 | aaa |
| 15 | 6 | bbb |
| 16 | 7 | ccc |
| 17 | 8 | ddd |
+---------+----+------+
Fetched 8 row(s) in 1.08s
[quickstart.cloudera:21000:21000] >

Returns an ascending sequence of integers stating from the max of the id loaded in the table. The ORDER BY clause is required. The PARTITION BY clause is optional. The window clause is not allowed. We have not used the PARTITION BY clause here as we are partitioning the all records from table in single partition.

Please be advised that this method of sequence number generation using Cloudera Impala analytics function wouldn’t be safe if you are running multiple insert concurrently on the table. You may end up generating duplicate sequence numbers if multiple insert statements on same table.

This Post Has One Comment

  1. Manish

    This is not ideal as in distributed environment it will create duplicate values

Comments are closed.