In this article, we will check how to work with Hadoop Streaming Map Reduce using Python.
Hadoop Streaming
First let us check about Hadoop streaming! Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. If you are using any language that support standard input and output, that can be used to write the Hadoop Map-Reduce job for examples, Python, C# etc.
Read:
- Hadoop HDFS Schema Design for ETL Process
- Hadoop Data Warehouse and Design Considerations
- 7 Best Hadoop Books to Learn Bigdata Hadoop
- Import using Apache Sqoop
- Export using Apache Sqoop
Hadoop Streaming Syntax
Below is the basic syntax of the Hadoop streaming:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -Dmapred.reduce.tasks=1 \ -input myInputDirs \ -output myOutputDir \ -mapper mapper.py \ -reducer reducer.py
If you are working on the Cloudera Hadoop distribution, then the Hadoop streaming jar file path would be: /usr/lib/hadoop-mapreduce/
Hadoop Streaming Options
You can verify all the Hadoop streaming jar file option by running below command:
$hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar –help
You must provide the input, output, mapper and reducer options as these are mandatory options.
Running a Basic Steaming Job Example
Running map-reduce program that are developed other than Java is just like normal Java map-reduce job, only thing is you need to provide some information about the script that you want to use with HDFS file system.
Below is sample map-reduce job using Hadoop streaming:
[user1@cdh-clstr1 ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ > -Dmapred.reduce.tasks=1 \ > -input /data/test1 \ > -output /data/test1/output \ > -mapper 'cat' \ > -reducer 'wc -l' -reducer 'wc -l'
Output of the map-reduce job is available at the HDFS /data/test1/output directory.
Passing Arguments to Hadoop Streaming
You can pass arguments to the mapper and reducer options along with the script that you are executing.
Below example illustrates passing arguments to Hadoop streaming. hadoop jar
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ > -mapper 'mapreduce.py map' \ > -reducer 'mapreduce.py reduce' -file mapreduce.py \
Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.