How to Learn Apache Hadoop

  • Post author:
  • Post last modified:July 6, 2018
  • Post category:BigData
  • Reading time:11 mins read

Most of you want to know what Apache Hadoop is, how and where to start learning it? Here I’m going to share you some of steps I followed to learn hadoop. 

How to Learn Hadoop

Don’t worry!  You don’t have to be a Java programmer to learn Hadoop. You should know little bit of basic Linux commands. You will learn all remaining programming languages once you login to cluster 🙂

Let’s first know what is Hadoop?

Apache Hadoop is an open source framework to process very large data sets (BigData). Hadoop allows the distributed storage and distributed processing on computer cluster built from commodity hardware. Hadoop framework is designed on assumption that hardware failure is common on commodity hardware’s and framework should handle the situation by itself.

I did learn Hadoop by following below steps and sharing same with you guys

Get to Know Apache Hadoop Architecture

First, make your basics strong. Understand the architecture of the Apache Hadoop.  One book that helped me a lot to get basic understanding of Hadoop is  Hadoop: The Definitive Guide. If you want to learn Hadoop, then you should definitely consider buying one for yourself. You should also consider Hadoop Application Architectures , this book will give you clear understanding of best practices you should follow when developing application on Hadoop platform.

Hadoop Definitive guide

Hadoop Application Architectures

These books will definitely gives you basic understanding of the everything you want to know about the Hadoop components.

Understand the HDFS Architecture

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and you can deploy on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have very large data sets (Big data).

Hadoop: The Definitive Guide book has the separate chapter on HDFS. You can also get more information on HDFS architecture in Apache website.

Understand the MapReduce

MapReduce is a programming model for data processing in distributed environments. MapReduce programs are inherently parllel. Hadoop can run the MapReduce program which are written in various programming language such as Java, Python etc.

Even-though Spark in memory computation will replace MapReduce in near future, you should learn basic concept of MapReduce to know how Hadoop actually works in distributed systems.

Like i mentioned earlier, Hadoop: The Definitive Guide book has everything you need to know about fundamental concepts of Hadoop. You will get more information on MapReduce in MapReduce Design Patterns book.

Build Single Node Cluster

Next step, set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS).

Ubuntu Single Node Cluster

Below are some of software’s required to build single node cluster

  • Virtual machine (if you don’t want to install Linux on your system) to create Linux system preferably Ubuntu
  • Ubuntu download latest stable version

Read:  

Start Coding

Start creating your Hadoop MapReduce program in your favourite language, load some test data to HDFS and execute in the single node cluster that you have created. You may also download the examples from Hadoop: The Definitive Guide website and execute them to see how Hadoop processes data.

Understand Hive

If you are from RDBMS or Data warehouse background, Hive is just piece of cake for you. It gives you Database flavour on Hadoop.

Hive was created to make it possible for analysts with strong SQL skills to run queries on the huge volumes of data that Facebook stored in HDFS. Today, Hive is a successful Apache project used by many organisations as a general-purpose, scalable data processing platform.

One of my favourite book on Hive is Programming Hive.  Consider buying it if you want learn Hive.

Programming Hive

Consider reading Flume if you want to perform near real-time analytics and list goes on..

This should be enough to get start with your journey in Hadoop world.

Remember!! practice makes man perfect. You need lots of practice and practicals to master the Hadoop. Create your system, start reading and implementing programs.

Feels free to comment if you need any information or anything needs to be added.

And of course Happy reading.. 🙂