Being a MPP server, Greenplum supports parallel data loading for large amounts of data. It also supports single file, non-parallel import for small amounts of data. Greenplum data loading is supported by various methods as follows.
Read:
Data Loading Options
Greenplum supports following tools for loading
1. Greenplum data Loading with gpload Command
The gpload Greenplum data loading utility is an interface to external table parallel loading feature. gpload uses a load specification or layout defined in a YAML formatted control file to load data into the target table in specified database.
Syntax:
gpload -f new_load.yml
where new_load.yml – is the YAML formatted control file which has all the required information to load file.
Greenplum gpload Command Example
Sample YAML formatted control file to load data into airport table, gpload.yaml
--- VERSION:1.0.0.1 #describe the Greenplum database parameters DATABASE:TRAINING USER:user1 HOST:gpdb-training PORT:5432 #describe the location of the source files #in this example, the database master lives on the same host as the source files GPLOAD: INPUT: -SOURCE: LOCAL_HOSTNAME: -gpdb-training PORT:8081 FILE: -/data/source_files/L_AIRPORT*.gz -FORMAT: csv -QUOTE: '"' -ERROR_LIMIT: 50000 -ERROR_TABLE: training_load_errors OUTPUT: -TABLE: training.dim_airport -MODE: INSERT PRELOAD: -TRUNCATE: true SQL: -BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)" -AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
Start gpfdist utility. Use the –d switch to set the “home” directory used to search for files in the particular directory. Use the –p to set the port and background the process.
For Example
gpfdist -d ~/data/source_files -p 8081 > /tmp/gpfdist.log 2>&1 &
Once the environment is set then use below command to load data using gpload
gpload-f gpload.yaml
2. Greenplum data Loading with COPY Command
Greenplum COPY is non-parallel, with this data is loaded in a single process using the Greenplum master instance. COPY FROM copies data from a file or standard input into a table and appends the data to the table contents.
Syntax:
COPY table FROM {'file' | STDIN} [options]
Greenplum Copy Command Example
Load data to airport table
COPY airport FROM '/data/source_files' WITH DELIMITER '|' LOG ERRORS INTO err_airport SEGMENT REJECT LIMIT 10 ROWS;
3. Greenplum Data Loading External Tables
External tables allow you to access external file as if they are regular database tables. External tables provide full parallelism by using the resources of all Greenplum segments to load or unload data, if you use the external table with gpfdist, Greenplum parallel file distribution program. You can query external table by using SQL commands such as SELECT, JOIN etc.
Syntax:
You can use the below syntax to create readable EXTERNAL TABLE:
CREATE EXTERNAL TABLE table_name (column_name1 data_type1 , column_name2 data_type2 ... ) LOCATION('file://seghost[:port]/path/file' ) or ('gpfdist://filehost[:port]/file_pattern [or file_name]' ) or ('gpfdists://filehost[:port]/file_pattern [or file_name]' ) or ('gphdfs://hdfs_host[:port]/path/file') FORMAT 'TEXT' | 'CSV' |'CUSTOM' [options] [ENCODING 'encoding' ] [[LOG ERRORS INTO error_table] SEGMENT REJECT LIMIT count [ROWS |PERCENT] ]
Note: If you want to use gpfdist or gpfdists, then you must first execute gpfdist. Use the –d switch to set the “home” directory used to search for files in the particular directory. Use the –p to set the port and background the process.
For Example,
gpfdist -d ~/data/source_files -p 8081 > /tmp/gpfdist.log 2>&1 &
Once the environment is set, read data from EXTERNAL TABLE and INSERT INTO target table
For example,
INSERT INTO target_table SELECT * FROM external_table;
Greenplum CREATE EXTERNAL TABLE example
Create external table to read L_AIRPORTS.csv file and insert to DIM_AIRPORTS table.
CREATE EXTERNAL TABLE AIRPORTS ( AirportID text, Name text, City text, Country text, IATA_FAA_code text, ICOAcode text, Latitude text, Longitude text, Altitude text, TimeZoneOffset text, DST_Flag text, TZ text ) LOCATION ('gpfdist://localhost:8081/L_AIRPORTS.csv') FORMAT 'csv' (header) LOG ERRORS INTO faa_load_errors SEGMENT REJECT LIMIT 10 rows;
Use the –d switch to set the “home” directory used to search for files in the particular directory. Use the –p to set the port and background the process.
gpfdist -d ~/data/source_files -p 8081 > /tmp/gpfdist.log 2>&1 &
Read external table and load target table
INSERT INTO DIM_AIRPORTS SELECT * FROM AIRPORTS;