Hadoop Streaming Map Reduce using Python

Post author:Vithal S
Post last modified:February 27, 2018
Post category:BigData
Reading time:3 mins read

In this article, we will check how to work with Hadoop Streaming Map Reduce using Python.

Hadoop Streaming

First let us check about Hadoop streaming! Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. If you are using any language that support standard input and output, that can be used to write the Hadoop Map-Reduce job for examples, Python, C# etc.

Read:

Hadoop Streaming Syntax

Below is the basic syntax of the Hadoop streaming:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
 -Dmapred.reduce.tasks=1 \
 -input myInputDirs \
 -output myOutputDir \
 -mapper mapper.py \
 -reducer reducer.py

If you are working on the Cloudera Hadoop distribution, then the Hadoop streaming jar file path would be: /usr/lib/hadoop-mapreduce/

Hadoop Streaming Options

You can verify all the Hadoop streaming jar file option by running below command:

$hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar –help

You must provide the input, output, mapper and reducer options as these are mandatory options.

Running a Basic Steaming Job Example

Running map-reduce program that are developed other than Java is just like normal Java map-reduce job, only thing is you need to provide some information about the script that you want to use with HDFS file system.

Below is sample map-reduce job using Hadoop streaming:

[user1@cdh-clstr1 ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
> -Dmapred.reduce.tasks=1 \
> -input /data/test1 \
> -output /data/test1/output \
> -mapper 'cat' \
> -reducer 'wc -l' -reducer 'wc -l'

Output of the map-reduce job is available at the HDFS /data/test1/output directory.

Passing Arguments to Hadoop Streaming

You can pass arguments to the mapper and reducer options along with the script that you are executing.

Below example illustrates passing arguments to Hadoop streaming. hadoop jar

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
> -mapper 'mapreduce.py map' \
> -reducer 'mapreduce.py reduce' -file mapreduce.py \

Tags: Bigdata

This Post Has One Comment

Giselle aga January 8, 2018

Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

Comments are closed.