In this tutorial, I will explain you setting up Hadoop single node cluster setup on Ubuntu 14.04. Single node cluster will sit on the top of Hadoop Distributed File System (HDFS).
Hadoop single node cluster setup on Ubuntu 14.04
Hadoop is a Java framework for running application on the large cluster made up of commodity hardware’s. Hadoop framework allows us to run MapReduce programs on file system stored in highly fault-tolerant Hadoop distributed file systems.
- Related Readings: How to Learn Apache Hadoop
- Also: 7 Best Books to Learn Bigdata Hadoop
The main objective of this post is to get start with Hadoop single node cluster setup.This tutorial is been tested with below OS and Hadoop software versions;
Hadoop single node cluster setup – Presteps
Install Ubuntu 14.04 on your system:
First, you should have system with Ubuntu 14.04 installed on it. Follow below steps;
- Download Ubuntu 14.04 from official website, make bootable CD or drive and install or
- Download and install Oracle/VMware virtual machines then install Ubuntu 14.04 on that virtual machine (easy way)
Install Java
Remember! Hadoop framework is written in Java. It requires a working Java 1.6+ installation. Here apt-get is used to install Java
vdoop@vdoop-VirtualBox:~$ cd ~ # Update the source list vdoop@vdoop-VirtualBox:~$ sudo apt-get update # Install Java vdoop@vdoop-VirtualBox:~$ sudo apt-get install default-jdk # Verify Java installation by checking its version vdoop@vdoop-VirtualBox:~$ java -version java version "1.7.0_65" OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
Create Hadoop Group and add dedicated User
In this step, we will create new group named Hadoop and add dedicated user to it. That user will be the cluster admin and use that user for running any Hadoop applications. We will use dedicated user to install all the Hadoop related software installations.
vdoop@vdoop-VirtualBox:~$ sudo addgroup hadoop Adding group `hadoop' (GID 1002) ... Done. vdoop@vdoop-VirtualBox:~$ sudo adduser --ingroup hadoop hduser Adding user `hduser' ... Adding new user `hduser' (1001) with group `hadoop' ... Creating home directory `/home/hduser' ... Copying files from `/etc/skel' ... Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully Changing the user information for hduser Enter the new value, or press ENTER for the default Full Name []: Room Number []: Work Phone []: Home Phone []: Other []: Is the information correct? [Y/n] Y
Installing ssh
Hadoop framework requires ssh access to manages all its nodes. Even for our single node cluster we need to install and set up the ssh access to ‘localhost ‘ for our dedicated ‘hduser’
ssh uses two main components;
- ssh : The command we use to connect to remote machines – The client
- sshd : The daemon that is running on the server and allows clients to connect to the server
By default, ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first.
Use this command to do that:
vdoop@vdoop-VirtualBox:~$ sudo apt-get install ssh
The above command will install ssh on our machine. If we get something similar to the following, you can think setup done properly:
vdoop@vdoop-VirtualBox:~$ which ssh /usr/bin/ssh vdoop@vdoop-VirtualBox:~$ which sshd /usr/sbin/sshd
Create and Setup SSH Certificate
As mentioned in previous section, Hadoop requires ssh access to manage its nodes. We have already installed ssh, now we need to configure it to allow SSH public key authentication.
Hadoop uses ssh to access its node, which would normally require the user to enter a password. However, this requirement can be eliminated by creating and setting up SSH certificates using the following commands. If asked for a filename just leave it blank and press the enter key to continue.
vdoop@vdoop-VirtualBox:~$ su hduser
Password:
vdoop@vdoop-VirtualBox:~$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
Generating public/private dsa key pair.
/home/hduser/.ssh/id_dsa already exists.
Overwrite (y/n)? y
Your identification has been saved in /home/hduser/.ssh/id_dsa.
Your public key has been saved in /home/hduser/.ssh/id_dsa.pub.
The key fingerprint is:
51:05:8f:20:1c:8e:fc:5e:92:f5:38:e0:1b:5c:13:da hduser@vithal-Inspiron-3558
The key's randomart image is:
+--[ DSA 1024]----+
| .o.o oo. |
| . o.+ + o |
| o + E . . |
| + = = |
| B S . |
| . = . |
| o |
| |
| |
+-----------------+
hduser@vdoop-VirtualBox:/home/vdoop$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
The above command adds the newly created key to the list of authorized keys so that Hadoop can use ssh without prompting for a password.
We can check to see if ssh works:
hduser@vdoop-VirtualBox:/home/vdoop$ ssh localhost The authenticity of host 'localhost (127.0.0.1)' can't be established. ECDSA key fingerprint is e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64) ...
Installing Hadoop
Download latest stable hadoop version and copy to Ubuntu machine or download it directly in host machine
hduser@vdoop-VirtualBox:~$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz hduser@vdoop-VirtualBox:~$ tar xvzf hadoop-2.7.0.tar.gz
Add hduser to sudo group:
hduser@vdoop-VirtualBox:~/hadoop-2.7.0$ su vdoop Password: vdoop@vdoop-VirtualBox:/home/hduser$ sudo adduser hduser sudo [sudo] password for vdoop: Adding user `hduser' to group `sudo' ... Adding user hduser to group sudo Done.
Login to hduser and you may want to move the Hadoop installation to the /usr/local/hadoop directory using the following command
vdoop@vdoop-VirtualBox:/home/hduser$ sudo su hduser hduser@vdoop-VirtualBox:~/hadoop-2.7.0$ sudo mv * /usr/local/hadoop #change direcotroy ownership as well hduser@vdoop-VirtualBox:~/hadoop-2.7.0$ sudo chown -R hduser:hadoop /usr/local/hadoop
Setup Configuration Files
The following files will have to be modified to complete the Hadoop setup:
- ~/.bashrc
- /usr/local/hadoop/etc/hadoop/hadoop-env.sh
- /usr/local/hadoop/etc/hadoop/core-site.xml
- /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
- /usr/local/hadoop/etc/hadoop/hdfs-site.xml
- /usr/local/hadoop/etc/hadoop/yarn-site.xml
~/.bashrc
Before editing the .bashrc file in our hduser home directory, we need to find the path where Java has been installed to set the JAVA_HOME environment variable
Use following command:
hduser@vdoop-VirtualBox:~$ update-alternatives --config java There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java Nothing to configure.
Now we can append the Java path to JAVA_HOME of ~/.bashrc:
hduser@vdoop-VirtualBox:~$ vi ~/.bashrc #HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/static" #HADOOP VARIABLES END hduser@vdoop-VirtualBox:~$ source ~/.bashrc
Note: Be careful while copying commands to terminal. It may introduce the junk character or special characters. Change the command in case if you get error. e.g. double quotes (“”) may differ. Bog page may display double hyphen – as single hyphen. Make sure commands are correct before running.
Note that the JAVA_HOME should be set as the path just before the ‘…/bin/’: i.e. set only /usr/lib/jvm/java-7-openjdk-amd64 path
Now quick check if the javac is working properly:
hduser@ubuntu-VirtualBox:~$ javac -version javac 1.7.0_75 hduser@ubuntu-VirtualBox:~$ which javac /usr/bin/javac hduser@ubuntu-VirtualBox:~$ readlink -f /usr/bin/javac /usr/lib/jvm/java-7-openjdk-amd64/bin/javac
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
We need to set JAVA_HOME by modifying hadoop-env.sh file.
hduser@vdoop-VirtualBox:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME variable will be available to Hadoop whenever it is running.
/usr/local/hadoop/etc/hadoop/core-site.xml
The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses when starting.
This file can be used to override the default settings that Hadoop starts with.
hduser@vdoop-VirtualBox:~$ sudo mkdir -p /app/hadoop/tmp hduser@vdoop-VirtualBox:~$ sudo chown hduser:hadoop /app/hadoop/tmp
Open the file and enter the following in between the <configuration></configuration> tag:
hduser@vdoop-VirtualBox:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration>
/usr/local/hadoop/etc/hadoop/mapred-site.xml
By default, the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed to mapred-site.xml:
hduser@vdoop-VirtualBox:~$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
The mapred-site.xml file is used to specify which framework is being used for MapReduce.
We need to enter the following content in between the <configuration></configuration> tag:
hduser@vdoop-VirtualBox:~$ vi /usr/local/hadoop/etc/hadoop/mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster being used. It is used to specify the directories which will be used as the namenode and the datanode on that host.
Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation.
This can be done using the following commands:
hduser@vdoop-VirtualBox:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode hduser@vdoop-VirtualBox:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode hduser@vdoop-VirtualBox:~$ sudo chown -R hduser:hadoop /usr/local/hadoop_store
Open the file and enter the following content in between the <configuration></configuration> tag:
hduser@vdoop-VirtualBox:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property> <property> <name>fs.trash.interval</name> <value>5</value> </property> </configuration>
/usr/local/hadoop/etc/hadoop/yarn-site.xml
Open the file and enter the following content in between the <configuration></configuration> tag:
<configuration>
<!– Site specific YARN configuration properties –>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>localhost:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.adress</name>
<value>localhost:8088</value>
</property>
</configuration>
Format the New Hadoop Filesystem
We have modified all files, next step, the Hadoop file system needs to be formatted so that we can start to using it. The format command should be issued with write permission since it creates current directory in /usr/local/hadoop_store/hdfs/namenode folder
hduser@vdoop-VirtualBox:~$ hadoop namenode -format
The above command will format. Next step is to start the cluster
Starting Hadoop
Now the Hadoop is installed and configured. It’s time to start it, We can use start-all.sh
vdoop@vdoop-VirtualBox:~$ cd /usr/local/hadoop/sbin vdoop@vdoop-VirtualBox:/usr/local/hadoop/sbin$ ls distribute-exclude.sh start-all.cmd stop-balancer.sh hadoop-daemon.sh start-all.sh stop-dfs.cmd hadoop-daemons.sh start-balancer.sh stop-dfs.sh hdfs-config.cmd start-dfs.cmd stop-secure-dns.sh hdfs-config.sh start-dfs.sh stop-yarn.cmd httpfs.sh start-secure-dns.sh stop-yarn.sh kms.sh start-yarn.cmd yarn-daemon.sh mr-jobhistory-daemon.sh start-yarn.sh yarn-daemons.sh refresh-namenodes.sh stop-all.cmd slaves.sh stop-all.sh vdoop@vdoop-VirtualBox:/usr/local/hadoop/sbin$ sudo su hduser hduser@vdoop-VirtualBox:/usr/local/hadoop/sbin$ start-all.sh hduser@vdoop-VirtualBox:~$ start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh 15/04/18 16:43:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting namenodes on [localhost] localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-laptop.out localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-laptop.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-secondarynamenode-laptop.out 15/04/18 16:43:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-laptop.out localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-laptop.out
Use below command to check if Hadoop cluster is running
hduser@vdoop-VirtualBox:/usr/local/hadoop/sbin$ jps 9026 NodeManager 7348 NameNode 9766 Jps 8887 ResourceManager 7507 DataNode
The above output indicates that we have created Hadoop cluster which is up and running.
Stopping Hadoop
Follow below steps to stop the Hadoop
$ pwd /usr/local/hadoop/sbin $ ls distribute-exclude.sh httpfs.sh start-all.sh start-yarn.cmd stop-dfs.cmd yarn-daemon.sh hadoop-daemon.sh mr-jobhistory-daemon.sh start-balancer.sh start-yarn.sh stop-dfs.sh yarn-daemons.sh hadoop-daemons.sh refresh-namenodes.sh start-dfs.cmd stop-all.cmd stop-secure-dns.sh hdfs-config.cmd slaves.sh start-dfs.sh stop-all.sh stop-yarn.cmd hdfs-config.sh start-all.cmd start-secure-dns.sh stop-balancer.sh stop-yarn.sh
We run stop-all.sh to stop all the daemons running on our machine
hduser@vdoop-VirtualBox:/usr/local/hadoop/sbin$ stop-all.sh This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh 15/04/18 15:46:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Stopping namenodes on [localhost] localhost: stopping namenode localhost: stopping datanode Stopping secondary namenodes [0.0.0.0] 0.0.0.0: no secondarynamenode to stop 15/04/18 15:46:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable stopping yarn daemons stopping resourcemanager localhost: stopping nodemanager no proxyserver to stop
Named node UI
Use below local host https link to open web UI of namenode daemon
http://localhost:50070/ - web UI of the NameNode daemon
Congratulations!! you have set up the single node hadoop cluster.
Feel free to comment if you need anything..