-
Explain Greenplum Architecture.
Read Post: Greenplum Architecture
-
How data is distributed using hash algorithm?
Read Post : How Greenplum Hash Distribution Works
-
What are different ways to get data into Greenplum data warehouse?
COPY FROM
Gpload
INSERT statement
Create EXTERNAL TABLE
- Explain how data is stored in Greenplu?
Data is stored based on selected field (s) which are used for distribution. When you have a Distribution Key by Hash the values of the Distribution Key are run through a Hash Formula. Then, a map is used to distribute the row to the correct segment. The formula is designed to be consistent so that all like values go to the same segment.
==Data(A) => Hash Function(B) => Logical Segment list(C) => Physical Segment list(D) => Storage(E).
When data arrives at the Greenplum, it is hashed based on field(s) and a hash function (B) is used for this purpose.
For example, Consider 4 node system, logical segment list has 4 unique entries. If there are 10 hashed data items from (B), there are 10 entries in (C), then all having only 4 segment entries. For example (C) has values [4,1,2,3,4,3,1,4,3,2]. Then, a map is used to distribute the row to the correct segment. The formula is designed to be consistent so that all like values go to the same segment.
Read : Greenplum Hash Distribution
- What are the environmental variables required to connect to Greenplum?
There are five environmental variables required to connect to Greenplum
PGHOST, PGPORT, PGUSER, PGPASSWORD, PGDATABASE
- How do you connect to Greenplum system without password prompt?
Assign the values to Greenplum environmental variables and export value in bash shell. You can provide the below statement in .bashrc file which is present in user $HOME directory. Values will be set everytime you login to system.
export PGHOST=master_host_name/IP address export PGPORT=port. export PGUSER=username export PGPASSWORD=password export PGDATABASE=database
Read: Access Greenplum Database with No Password Prompt
- What are the constraints supported by Greenplum?
Check, Not Null, Primary key, Unique.
Foreign key is not supported. You can mention but it is not enforced.
Read: Greenplum Constraints
- How Greenplum updates records?
Greenplum does not update the records, it marks the delete flag.
Each record contains two slots, xmin and xmax.
Updating a row is treated as a delete and an insert, so the XID (transaction ID) is saved to the xmax of the current row and the xmin of the newly inserted row.
Now let us see how update works with help of an example;
Sample data:
[ROW id][xmin][xmax]
[R1][T1][0] —————-> First time record is inserted, R1
Updating same records after sometime
[R1][T1][T22] ————-> Record R1 is updated; T22 is the new row transaction id
[R22][T22][0] ————-> New updated record R33; Note 0 in xmax indicates new record.
If the record is deleted simply xmax will have the XID.
- How do you change the distribution of the table?
You can change the table distribution by using ALTER TABLE statement along with SET option.
ALTER TABLE table_name SET DISTRIBUTED BY (new_column);
- Create Table AS (CTAS), does it distribute data randomly or based on table on which it received data.
Newly created table from CTAS gets distribution from the original table.
- Which one is better DELETE or TRUNCATE? Why?
Of course TRUNCATE. If you use DELETE, Greenplum will not remove the data; instead it’ll logically remove by flagging xmax with XID.
- Can we update all columns in Greenplum?
No. Distribution key columns may not be updated.
- When no distribution clause is used while creating a table, what distribution is used by Greenplum?
Uses hash distribution on first column. Same as Netezza and Teradata
- What is the segment in Greenplum?
Database instances in the Greenplum system are called segments. Segment stores the data and carry out the query processing. In Greenplum distributed systems, each segment contains a distinct portion of the data.
- What are gangs in Greenplum?The processes that are working on the same slice of the query plan but on different segments are called gangs.
- Can you create zero-column table in Greenplum?
Yes. We can create a table with no columns. For example,
training=> create table test(); NOTICE: Table has no attributes to distribute on. CREATE TABLE
- Upto how many columns can you provide in DISTRIBUTED BY clause?
You can provide as many column as you have in table. However, it is recommended to use max two columns in DISTRIBUTED BY clause
-
How do you generate the DDL for the particular table?
use pg_dump utility with option -s and -t.
For example,
pg_dump -U user_name -h host database -s -t test -f test.sql
- How do you derive columns from previously computed column values?
You can use Greenplum WITH or Common Table Expressions (CTE) to derive columns from previously computed columns.
Read: Greenplum Computed Column Support and Alternative
Thabks a lot for sharing..
Nice set of questions.. real life interview questions.. very good collection
Thanks
1) Various backup/restore options available in Greenplum
2) Diff between parallel and non-parallel backup
3) Diff between insert, copy and gpload utility
4) how a query executed in Greenplum Environment
5) What are the limitations of Greenplum Database
6) What are different ways to copy/migrate data from one Greenplum cluster to another
Basic Questions
=> Append only Tables?
=> Resource Queue?
=> pg_hba.conf & posgresql.conf?
Advance Question
=> how to implement Security in Greenplum
=> Add/Remove any segment
=> kill any process
=> GPTRANSFER
=> Work load manager
Thanks for your inputs