Most of you want to know what Apache Hadoop is, how and where to start learning it? Here I’m going to share you some of steps I followed to learn hadoop.
Don’t worry! You don’t have to be a Java programmer to learn Hadoop. You should know little bit of basic Linux commands. You will learn all remaining programming languages once you login to cluster 🙂
Let’s first know what is Hadoop?
Apache Hadoop is an open source framework to process very large data sets (BigData). Hadoop allows the distributed storage and distributed processing on computer cluster built from commodity hardware. Hadoop framework is designed on assumption that hardware failure is common on commodity hardware’s and framework should handle the situation by itself.
I did learn Hadoop by following below steps and sharing same with you guys
Get to Know Apache Hadoop Architecture
First, make your basics strong. Understand the architecture of the Apache Hadoop. One book that helped me a lot to get basic understanding of Hadoop is Hadoop: The Definitive Guide. If you want to learn Hadoop, then you should definitely consider buying one for yourself. You should also consider Hadoop Application Architectures , this book will give you clear understanding of best practices you should follow when developing application on Hadoop platform.
These books will definitely gives you basic understanding of the everything you want to know about the Hadoop components.
Understand the HDFS Architecture
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and you can deploy on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have very large data sets (Big data).
Hadoop: The Definitive Guide book has the separate chapter on HDFS. You can also get more information on HDFS architecture in Apache website.
Understand the MapReduce
MapReduce is a programming model for data processing in distributed environments. MapReduce programs are inherently parllel. Hadoop can run the MapReduce program which are written in various programming language such as Java, Python etc.
Even-though Spark in memory computation will replace MapReduce in near future, you should learn basic concept of MapReduce to know how Hadoop actually works in distributed systems.
Like i mentioned earlier, Hadoop: The Definitive Guide book has everything you need to know about fundamental concepts of Hadoop. You will get more information on MapReduce in MapReduce Design Patterns book.
Build Single Node Cluster
Next step, set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS).
Below are some of software’s required to build single node cluster
- Virtual machine (if you don’t want to install Linux on your system) to create Linux system preferably Ubuntu
- Ubuntu download latest stable version
Read:
Start Coding
Start creating your Hadoop MapReduce program in your favourite language, load some test data to HDFS and execute in the single node cluster that you have created. You may also download the examples from Hadoop: The Definitive Guide website and execute them to see how Hadoop processes data.
Understand Hive
If you are from RDBMS or Data warehouse background, Hive is just piece of cake for you. It gives you Database flavour on Hadoop.
Hive was created to make it possible for analysts with strong SQL skills to run queries on the huge volumes of data that Facebook stored in HDFS. Today, Hive is a successful Apache project used by many organisations as a general-purpose, scalable data processing platform.
One of my favourite book on Hive is Programming Hive. Consider buying it if you want learn Hive.
Consider reading Flume if you want to perform near real-time analytics and list goes on..
This should be enough to get start with your journey in Hadoop world.
Remember!! practice makes man perfect. You need lots of practice and practicals to master the Hadoop. Create your system, start reading and implementing programs.
Feels free to comment if you need any information or anything needs to be added.
And of course Happy reading.. 🙂