Greenplum Data Loading Options

  • Post author:
  • Post last modified:February 28, 2018
  • Post category:Greenplum
  • Reading time:5 mins read

Being a MPP server, Greenplum supports parallel data loading for large amounts of data. It also supports single file, non-parallel import for small amounts of data. Greenplum data loading is supported by various methods as follows.

Read:

Greenplum Data Loading

Data Loading Options

Greenplum supports following tools for loading

1. Greenplum data Loading with gpload Command

The gpload Greenplum data loading utility is an interface to external table parallel loading feature. gpload uses a load specification or layout defined in a YAML formatted control file to load data into the target table in specified database.

Syntax:

gpload -f new_load.yml

where new_load.yml – is the YAML formatted control file which has all the required information to load file.

Greenplum gpload Command Example

Sample YAML formatted control file to load data into airport table, gpload.yaml

---
VERSION:1.0.0.1
#describe the Greenplum database parameters
DATABASE:TRAINING
USER:user1
HOST:gpdb-training
PORT:5432
#describe the location of the source files
#in this example, the database master lives on the same host as the source files
GPLOAD:
 INPUT:
 -SOURCE:
 LOCAL_HOSTNAME:
 -gpdb-training
 PORT:8081
 FILE:
 -/data/source_files/L_AIRPORT*.gz
 -FORMAT: csv
 -QUOTE: '"'
 -ERROR_LIMIT: 50000
 -ERROR_TABLE: training_load_errors
 OUTPUT:
 -TABLE: training.dim_airport
 -MODE: INSERT
 PRELOAD:
 -TRUNCATE: true
SQL:
 -BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
 -AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"

Start gpfdist utility. Use the –d switch to set the “home” directory used to search for files in the particular directory. Use the –p to set the port and background the process.

For Example

gpfdist -d ~/data/source_files -p 8081 > /tmp/gpfdist.log 2>&1 &

Once the environment is set then use below command to load data using gpload

gpload-f gpload.yaml

2. Greenplum data Loading with COPY Command

Greenplum COPY is non-parallel, with this data is loaded in a single process using the Greenplum master instance. COPY FROM copies data from a file or standard input into a table and appends the data to the table contents.

Syntax:

COPY table FROM {'file' | STDIN} [options]

Greenplum Copy Command Example

Load data to airport table

COPY airport FROM '/data/source_files'  WITH DELIMITER '|' LOG ERRORS INTO err_airport  SEGMENT REJECT LIMIT 10 ROWS;

3. Greenplum Data Loading External Tables

External tables allow you to access external file as if they are regular database tables. External tables provide full parallelism by using the resources of all Greenplum segments to load or unload data, if you use the external table with gpfdist, Greenplum parallel file distribution program. You can query external table by using SQL commands such as SELECT, JOIN etc.

Syntax:

You can use the below syntax to create readable EXTERNAL TABLE:

CREATE EXTERNAL TABLE table_name
 (column_name1 data_type1 , column_name2 data_type2 ... )
 LOCATION('file://seghost[:port]/path/file' ) or ('gpfdist://filehost[:port]/file_pattern [or file_name]' ) or ('gpfdists://filehost[:port]/file_pattern [or file_name]' ) or ('gphdfs://hdfs_host[:port]/path/file')
 FORMAT 'TEXT' | 'CSV' |'CUSTOM' [options]
[ENCODING 'encoding' ]
[[LOG ERRORS INTO error_table] SEGMENT REJECT LIMIT count [ROWS |PERCENT] ]

Note: If you want to use gpfdist or gpfdists, then you must first execute gpfdist. Use the –d switch to set the “home” directory used to search for files in the particular directory. Use the –p to set the port and background the process.

For Example,

gpfdist -d ~/data/source_files -p 8081 > /tmp/gpfdist.log 2>&1 &

Once the environment is set, read data from EXTERNAL TABLE and INSERT INTO target table
For example,

INSERT INTO target_table SELECT * FROM external_table;

Greenplum CREATE EXTERNAL TABLE example

Create external table to read L_AIRPORTS.csv file and insert to DIM_AIRPORTS table.

CREATE EXTERNAL TABLE AIRPORTS
(
AirportID text, 
Name text,
City text,
Country text,
IATA_FAA_code text,
ICOAcode text, 
Latitude text,
Longitude text, 
Altitude text,
TimeZoneOffset text,
DST_Flag text,
TZ text
)
LOCATION ('gpfdist://localhost:8081/L_AIRPORTS.csv')
FORMAT 'csv' (header)
LOG ERRORS INTO faa_load_errors SEGMENT REJECT LIMIT 10 rows;

Use the –d switch to set the “home” directory used to search for files in the particular directory. Use the –p to set the port and background the process.

gpfdist -d ~/data/source_files -p 8081 > /tmp/gpfdist.log 2>&1 &

Read external table and load target table

INSERT INTO DIM_AIRPORTS SELECT * FROM AIRPORTS;