Greenplum Interview Questions and Answers - Part1

Explain Greenplum Architecture.

Read Post: Greenplum Architecture
How data is distributed using hash algorithm?

Read Post : How Greenplum Hash Distribution Works
What are different ways to get data into Greenplum data warehouse?

COPY FROM

Gpload

INSERT statement

Create EXTERNAL TABLE

Explain how data is stored in Greenplu?

Data is stored based on selected field (s) which are used for distribution. When you have a Distribution Key by Hash the values of the Distribution Key are run through a Hash Formula. Then, a map is used to distribute the row to the correct segment. The formula is designed to be consistent so that all like values go to the same segment.

==Data(A) => Hash Function(B) => Logical Segment list(C) => Physical Segment list(D) => Storage(E).

When data arrives at the Greenplum, it is hashed based on field(s) and a hash function (B) is used for this purpose.

For example, Consider 4 node system, logical segment list has 4 unique entries. If there are 10 hashed data items from (B), there are 10 entries in (C), then all having only 4 segment entries. For example (C) has values [4,1,2,3,4,3,1,4,3,2]. Then, a map is used to distribute the row to the correct segment. The formula is designed to be consistent so that all like values go to the same segment.

Read : Greenplum Hash Distribution

What are the environmental variables required to connect to Greenplum?

There are five environmental variables required to connect to Greenplum

PGHOST, PGPORT, PGUSER, PGPASSWORD, PGDATABASE

How do you connect to Greenplum system without password prompt?

Assign the values to Greenplum environmental variables and export value in bash shell. You can provide the below statement in .bashrc file which is present in user $HOME directory. Values will be set everytime you login to system.

export PGHOST=master_host_name/IP address
export PGPORT=port. 
export PGUSER=username
export PGPASSWORD=password
export PGDATABASE=database

Read: Access Greenplum Database with No Password Prompt

What are the constraints supported by Greenplum?

Check, Not Null, Primary key, Unique.

Foreign key is not supported. You can mention but it is not enforced.

Read: Greenplum Constraints

How Greenplum updates records?

Greenplum does not update the records, it marks the delete flag.

Each record contains two slots, xmin and xmax.

Updating a row is treated as a delete and an insert, so the XID (transaction ID) is saved to the xmax of the current row and the xmin of the newly inserted row.

Now let us see how update works with help of an example;

Sample data:

[ROW id][xmin][xmax]

[R1][T1][0] —————-> First time record is inserted, R1

Updating same records after sometime

[R1][T1][T22] ————-> Record R1 is updated; T22 is the new row transaction id

[R22][T22][0] ————-> New updated record R33; Note 0 in xmax indicates new record.

If the record is deleted simply xmax will have the XID.

How do you change the distribution of the table?

You can change the table distribution by using ALTER TABLE statement along with SET option.

ALTER TABLE table_name SET DISTRIBUTED BY (new_column);

Create Table AS (CTAS), does it distribute data randomly or based on table on which it received data.

Newly created table from CTAS gets distribution from the original table.

Which one is better DELETE or TRUNCATE? Why?

Of course TRUNCATE. If you use DELETE, Greenplum will not remove the data; instead it’ll logically remove by flagging xmax with XID.

Can we update all columns in Greenplum?

No. Distribution key columns may not be updated.

When no distribution clause is used while creating a table, what distribution is used by Greenplum?

Uses hash distribution on first column. Same as Netezza and Teradata

What is the segment in Greenplum?

Database instances in the Greenplum system are called segments. Segment stores the data and carry out the query processing. In Greenplum distributed systems, each segment contains a distinct portion of the data.

What are gangs in Greenplum?The processes that are working on the same slice of the query plan but on different segments are called gangs.
Can you create zero-column table in Greenplum?

Yes. We can create a table with no columns. For example,

training=> create table test(); 
NOTICE: Table has no attributes to distribute on. 
CREATE TABLE

Upto how many columns can you provide in DISTRIBUTED BY clause?

You can provide as many column as you have in table. However, it is recommended to use max two columns in DISTRIBUTED BY clause

How do you generate the DDL for the particular table?

use pg_dump utility with option -s and -t.

For example,
```
pg_dump -U user_name -h host database -s -t test -f test.sql
```
How do you derive columns from previously computed column values?

You can use Greenplum WITH or Common Table Expressions (CTE) to derive columns from previously computed columns.

Read: Greenplum Computed Column Support and Alternative

This Post Has 4 Comments

Hobby August 28, 2016

Thabks a lot for sharing..
Nice set of questions.. real life interview questions.. very good collection
Gurupreet Singh Bhatia October 21, 2016

Thanks

1) Various backup/restore options available in Greenplum
2) Diff between parallel and non-parallel backup
3) Diff between insert, copy and gpload utility
4) how a query executed in Greenplum Environment
5) What are the limitations of Greenplum Database
6) What are different ways to copy/migrate data from one Greenplum cluster to another
Gurupreet Singh Bhatia October 21, 2016

Basic Questions
=> Append only Tables?
=> Resource Queue?
=> pg_hba.conf & posgresql.conf?

Advance Question
=> how to implement Security in Greenplum
=> Add/Remove any segment
=> kill any process
=> GPTRANSFER
=> Work load manager
1. Vithal Sampagar October 21, 2016
  
  Thanks for your inputs

Comments are closed.