Hadoop Single Node Cluster Setup on Ubuntu

In this tutorial, I will explain you setting up Hadoop single node cluster setup on Ubuntu 14.04. Single node cluster will sit on the top of Hadoop Distributed File System (HDFS).

Hadoop single node cluster setup on Ubuntu 14.04

Hadoop is a Java framework for running application on the large cluster made up of commodity hardware’s. Hadoop framework allows us to run MapReduce programs on file system stored in highly fault-tolerant Hadoop distributed file systems.

Related Readings: How to Learn Apache Hadoop
Also: 7 Best Books to Learn Bigdata Hadoop

hadoop single node cluster setup

The main objective of this post is to get start with Hadoop single node cluster setup.This tutorial is been tested with below OS and Hadoop software versions;

Hadoop single node cluster setup – Presteps

Install Ubuntu 14.04 on your system:

First, you should have system with Ubuntu 14.04 installed on it. Follow below steps;

Download Ubuntu 14.04 from official website, make bootable CD or drive and install or
Download and install Oracle/VMware virtual machines then install Ubuntu 14.04 on that virtual machine (easy way)

Install Java

Remember! Hadoop framework is written in Java. It requires a working Java 1.6+ installation. Here apt-get is used to install Java

vdoop@vdoop-VirtualBox:~$ cd ~

 # Update the source list

vdoop@vdoop-VirtualBox:~$ sudo apt-get update

 # Install Java

vdoop@vdoop-VirtualBox:~$ sudo apt-get install default-jdk

 # Verify Java installation by checking its version

vdoop@vdoop-VirtualBox:~$ java -version

java version "1.7.0_65"

OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1)

OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Create Hadoop Group and add dedicated User

In this step, we will create new group named Hadoop and add dedicated user to it. That user will be the cluster admin and use that user for running any Hadoop applications. We will use dedicated user to install all the Hadoop related software installations.

vdoop@vdoop-VirtualBox:~$ sudo addgroup hadoop
 Adding group `hadoop' (GID 1002) ...
 Done.

vdoop@vdoop-VirtualBox:~$ sudo adduser --ingroup hadoop hduser
 Adding user `hduser' ...
 Adding new user `hduser' (1001) with group `hadoop' ...
 Creating home directory `/home/hduser' ...
 Copying files from `/etc/skel' ...
 Enter new UNIX password:
 Retype new UNIX password:
 passwd: password updated successfully
 Changing the user information for hduser
 Enter the new value, or press ENTER for the default
 Full Name []:
 Room Number []:
 Work Phone []:
 Home Phone []:
 Other []:
 Is the information correct? [Y/n] Y

Installing ssh

Hadoop framework requires ssh access to manages all its nodes. Even for our single node cluster we need to install and set up the ssh access to ‘localhost ‘ for our dedicated ‘hduser’

ssh uses two main components;

ssh : The command we use to connect to remote machines – The client
sshd : The daemon that is running on the server and allows clients to connect to the server

By default, ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first.

Use this command to do that:

vdoop@vdoop-VirtualBox:~$ sudo apt-get install ssh

The above command will install ssh on our machine. If we get something similar to the following, you can think setup done properly:

vdoop@vdoop-VirtualBox:~$ which ssh

/usr/bin/ssh

vdoop@vdoop-VirtualBox:~$ which sshd

/usr/sbin/sshd

Create and Setup SSH Certificate

As mentioned in previous section, Hadoop requires ssh access to manage its nodes. We have already installed ssh, now we need to configure it to allow SSH public key authentication.

Hadoop uses ssh to access its node, which would normally require the user to enter a password. However, this requirement can be eliminated by creating and setting up SSH certificates using the following commands. If asked for a filename just leave it blank and press the enter key to continue.

vdoop@vdoop-VirtualBox:~$ su hduser
 Password:
 vdoop@vdoop-VirtualBox:~$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
 Generating public/private dsa key pair.
 /home/hduser/.ssh/id_dsa already exists.
 Overwrite (y/n)? y
 Your identification has been saved in /home/hduser/.ssh/id_dsa.
 Your public key has been saved in /home/hduser/.ssh/id_dsa.pub.
 The key fingerprint is:
 51:05:8f:20:1c:8e:fc:5e:92:f5:38:e0:1b:5c:13:da hduser@vithal-Inspiron-3558
 The key's randomart image is:
 +--[ DSA 1024]----+
 | .o.o oo. |
 | . o.+ + o |
 | o + E . . |
 | + = = |
 | B S . |
 | . = . |
 | o |
 | |
 | |
 +-----------------+

hduser@vdoop-VirtualBox:/home/vdoop$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

The above command adds the newly created key to the list of authorized keys so that Hadoop can use ssh without prompting for a password.

We can check to see if ssh works:

hduser@vdoop-VirtualBox:/home/vdoop$ ssh localhost
 The authenticity of host 'localhost (127.0.0.1)' can't be established.
 ECDSA key fingerprint is e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f.
 Are you sure you want to continue connecting (yes/no)? yes
 Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
 Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64)
 ...

Installing Hadoop

Download latest stable hadoop version and copy to Ubuntu machine or download it directly in host machine

hduser@vdoop-VirtualBox:~$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

hduser@vdoop-VirtualBox:~$ tar xvzf hadoop-2.7.0.tar.gz

Add hduser to sudo group:

hduser@vdoop-VirtualBox:~/hadoop-2.7.0$ su vdoop
 Password:

vdoop@vdoop-VirtualBox:/home/hduser$ sudo adduser hduser sudo
 [sudo] password for vdoop:
 Adding user `hduser' to group `sudo' ...
 Adding user hduser to group sudo
 Done.

Login to hduser and you may want to move the Hadoop installation to the /usr/local/hadoop directory using the following command

vdoop@vdoop-VirtualBox:/home/hduser$ sudo su hduser

hduser@vdoop-VirtualBox:~/hadoop-2.7.0$ sudo mv * /usr/local/hadoop

#change direcotroy ownership as well
 hduser@vdoop-VirtualBox:~/hadoop-2.7.0$ sudo chown -R hduser:hadoop /usr/local/hadoop

Setup Configuration Files

The following files will have to be modified to complete the Hadoop setup:

~/.bashrc
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
/usr/local/hadoop/etc/hadoop/core-site.xml
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
/usr/local/hadoop/etc/hadoop/yarn-site.xml

~/.bashrc

Before editing the .bashrc file in our hduser home directory, we need to find the path where Java has been installed to set the JAVA_HOME environment variable

Use following command:

hduser@vdoop-VirtualBox:~$ update-alternatives --config java
 There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
 Nothing to configure.

Now we can append the Java path to JAVA_HOME of ~/.bashrc:

hduser@vdoop-VirtualBox:~$ vi ~/.bashrc

#HADOOP VARIABLES START
 export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
 export HADOOP_INSTALL=/usr/local/hadoop
 export PATH=$PATH:$HADOOP_INSTALL/bin
 export PATH=$PATH:$HADOOP_INSTALL/sbin
 export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
 export HADOOP_COMMON_HOME=$HADOOP_INSTALL
 export HADOOP_HDFS_HOME=$HADOOP_INSTALL
 export YARN_HOME=$HADOOP_INSTALL
 export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
 export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/static"
 #HADOOP VARIABLES END

hduser@vdoop-VirtualBox:~$ source ~/.bashrc

Note: Be careful while copying commands to terminal. It may introduce the junk character or special characters. Change the command in case if you get error. e.g. double quotes (“”) may differ. Bog page may display double hyphen – as single hyphen. Make sure commands are correct before running.

Note that the JAVA_HOME should be set as the path just before the ‘…/bin/’: i.e. set only /usr/lib/jvm/java-7-openjdk-amd64 path

Now quick check if the javac is working properly:

hduser@ubuntu-VirtualBox:~$ javac -version
 javac 1.7.0_75

hduser@ubuntu-VirtualBox:~$ which javac
 /usr/bin/javac

hduser@ubuntu-VirtualBox:~$ readlink -f /usr/bin/javac
 /usr/lib/jvm/java-7-openjdk-amd64/bin/javac

/usr/local/hadoop/etc/hadoop/hadoop-env.sh

We need to set JAVA_HOME by modifying hadoop-env.sh file.

hduser@vdoop-VirtualBox:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME variable will be available to Hadoop whenever it is running.

/usr/local/hadoop/etc/hadoop/core-site.xml

The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses when starting.
This file can be used to override the default settings that Hadoop starts with.

hduser@vdoop-VirtualBox:~$ sudo mkdir -p /app/hadoop/tmp

hduser@vdoop-VirtualBox:~$ sudo chown hduser:hadoop /app/hadoop/tmp

Open the file and enter the following in between the <configuration></configuration> tag:

hduser@vdoop-VirtualBox:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml

<configuration>
 <property>
 <name>hadoop.tmp.dir</name>
 <value>/app/hadoop/tmp</value>
 <description>A base for other temporary directories.</description>
 </property>

<property>
 <name>fs.default.name</name>
 <value>hdfs://localhost:54310</value>
 <description>The name of the default file system. A URI whose
 scheme and authority determine the FileSystem implementation. The
 uri's scheme determines the config property (fs.SCHEME.impl) naming
 the FileSystem implementation class. The uri's authority is used to
 determine the host, port, etc. for a filesystem.</description>
 </property>
 </configuration>

/usr/local/hadoop/etc/hadoop/mapred-site.xml

By default, the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed to mapred-site.xml:

hduser@vdoop-VirtualBox:~$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

The mapred-site.xml file is used to specify which framework is being used for MapReduce.

We need to enter the following content in between the <configuration></configuration> tag:

hduser@vdoop-VirtualBox:~$ vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>

 <property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 </property>

</configuration>

/usr/local/hadoop/etc/hadoop/hdfs-site.xml

The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster being used. It is used to specify the directories which will be used as the namenode and the datanode on that host.

Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation.

This can be done using the following commands:

hduser@vdoop-VirtualBox:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
 hduser@vdoop-VirtualBox:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
 hduser@vdoop-VirtualBox:~$ sudo chown -R hduser:hadoop /usr/local/hadoop_store

Open the file and enter the following content in between the <configuration></configuration> tag:

hduser@vdoop-VirtualBox:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
 <property>
 <name>dfs.replication</name>
 <value>1</value>
 <description>Default block replication.
 The actual number of replications can be specified when the file is created.
 The default is used if replication is not specified in create time.
 </description>
 </property>
 <property>
 <name>dfs.namenode.name.dir</name>
 <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
 </property>
 <property>
 <name>dfs.datanode.data.dir</name>
 <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
 </property>
<property>
<name>fs.trash.interval</name>
<value>5</value>
</property>
 </configuration>

/usr/local/hadoop/etc/hadoop/yarn-site.xml

Open the file and enter the following content in between the <configuration></configuration> tag:

<!– Site specific YARN configuration properties –>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>localhost:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.adress</name>
<value>localhost:8088</value>
</property>
</configuration>

Format the New Hadoop Filesystem

We have modified all files, next step, the Hadoop file system needs to be formatted so that we can start to using it. The format command should be issued with write permission since it creates current directory in /usr/local/hadoop_store/hdfs/namenode folder

hduser@vdoop-VirtualBox:~$ hadoop namenode -format

The above command will format. Next step is to start the cluster

Starting Hadoop

Now the Hadoop is installed and configured. It’s time to start it, We can use start-all.sh

vdoop@vdoop-VirtualBox:~$ cd /usr/local/hadoop/sbin

vdoop@vdoop-VirtualBox:/usr/local/hadoop/sbin$ ls
 distribute-exclude.sh start-all.cmd stop-balancer.sh
 hadoop-daemon.sh start-all.sh stop-dfs.cmd
 hadoop-daemons.sh start-balancer.sh stop-dfs.sh
 hdfs-config.cmd start-dfs.cmd stop-secure-dns.sh
 hdfs-config.sh start-dfs.sh stop-yarn.cmd
 httpfs.sh start-secure-dns.sh stop-yarn.sh
 kms.sh start-yarn.cmd yarn-daemon.sh
 mr-jobhistory-daemon.sh start-yarn.sh yarn-daemons.sh
 refresh-namenodes.sh stop-all.cmd
 slaves.sh stop-all.sh

vdoop@vdoop-VirtualBox:/usr/local/hadoop/sbin$ sudo su hduser

hduser@vdoop-VirtualBox:/usr/local/hadoop/sbin$ start-all.sh
 hduser@vdoop-VirtualBox:~$ start-all.sh
 This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
 15/04/18 16:43:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 Starting namenodes on [localhost]
 localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-laptop.out
 localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-laptop.out
 Starting secondary namenodes [0.0.0.0]
 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-secondarynamenode-laptop.out
 15/04/18 16:43:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 starting yarn daemons
 starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-laptop.out
 localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-laptop.out

Use below command to check if Hadoop cluster is running

hduser@vdoop-VirtualBox:/usr/local/hadoop/sbin$ jps
 9026 NodeManager
 7348 NameNode
 9766 Jps
 8887 ResourceManager
 7507 DataNode

The above output indicates that we have created Hadoop cluster which is up and running.

Stopping Hadoop

Follow below steps to stop the Hadoop

$ pwd
 /usr/local/hadoop/sbin

$ ls
 distribute-exclude.sh httpfs.sh start-all.sh start-yarn.cmd stop-dfs.cmd yarn-daemon.sh
 hadoop-daemon.sh mr-jobhistory-daemon.sh start-balancer.sh start-yarn.sh stop-dfs.sh yarn-daemons.sh
 hadoop-daemons.sh refresh-namenodes.sh start-dfs.cmd stop-all.cmd stop-secure-dns.sh

hdfs-config.cmd slaves.sh start-dfs.sh stop-all.sh stop-yarn.cmd
 hdfs-config.sh start-all.cmd start-secure-dns.sh stop-balancer.sh stop-yarn.sh

We run stop-all.sh to stop all the daemons running on our machine

hduser@vdoop-VirtualBox:/usr/local/hadoop/sbin$ stop-all.sh
 This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
 15/04/18 15:46:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 Stopping namenodes on [localhost]
 localhost: stopping namenode
 localhost: stopping datanode
 Stopping secondary namenodes [0.0.0.0]
 0.0.0.0: no secondarynamenode to stop
 15/04/18 15:46:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 stopping yarn daemons
 stopping resourcemanager
 localhost: stopping nodemanager
 no proxyserver to stop

Named node UI

Use below local host https link to open web UI of namenode daemon

http://localhost:50070/ - web UI of the NameNode daemon

Congratulations!! you have set up the single node hadoop cluster.

Feel free to comment if you need anything..