Hadoop Tutorial

“We are all different. Don’t judge, understand instead.”

Roy T. Bennett, The Light in the Heart

Contents

2. Single NodeOperation 3. Distributed Processing on Single Node 1. Introduction

Hadoop is a toolkit for big-data processing. It uses a cluster of computers to split data into multiple chunks and process each chunk on one machine and re-assemble the output.

Let us now get started with Hadoop.

Download the Hadoop binary distribution from Apache servers.

2. Single NodeOperation

Let us run a Hadoop job on a single node for understanding the basics of Hadoop processing. In this mode, the Hadoop data processing job runs on a single node as a single java process (no distributed computing).

We run a Hadoop job to find and count instances of a regular expression on a bunch of files.

2.1. Setup

Create a directory to be used as the working directory and change to that directory.

mkdir hadoop-getting-started && cd hadoop-getting-started

Extract the Hadoop binary tarball into this directory.A sub-directory such as hadoop-2.7.3 will be created containing the distribution.

tar xvzf hadoop-2.7.3.tar.gz

Ensure that JAVA_HOME is set and exported.

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_71

Run the Hadoop binary to ensure everything is setup correctly.

JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar

Note: The setting JAVA=java was required in the previous invocation since it appeared from the bin/hadoop executable shell script that this variable was being used. However there was no mention of this setting in the Hadoop docs (or at least, I didn’t see any). And the command would not run without this variable being set.

If everything has been setup correctly, you should see the following output from the above invocation.

An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
... 2.2. Running the Job

Let us run a sample job: attempting to find regular expression matches in files.

mkdir input
cp hadoop-2.7.3/etc/hadoop/*.xml input/
JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'a+'

The job should now start, print a bunch of diagnostic output and finish with a few seconds. Check the output:

cat output/*

This command looks through all the files in the input directory and shows you the regular expression matches found in the input files.

3. Distributed Processing on Single Node

This example runs as a distributed system (where each Hadoop daemon is a separate process), but on a single node. It serves to lay out the processing structure for a fully-distributed system on a cluster.

3.1. Password-less SSH

Generate a key-pair with the following command. It writes the private key of an RSA key-pair into ~/.ssh/id_rsa and the public key into ~/.ssh/id_rsa.pub .

You should enter a passphrase to protect the key. (And remember it!)

ssh-keygen

Append the public key file ~/.ssh/id_rsa.pub to ~/.ssh/authorized_keys file.

cat joe.pub >> ~/.ssh/authorized_keys

Ensure that ~/.ssh/authorized_keys is private (only you can read/write the file).

chmod 600 ~/.ssh/authorized_keys

Add the private key file to your list of SSH identities. You will need to enter your passphrase here to do that.

ssh-add joe

You should now be able to login to the localhost without a password. (If it doesn’t work, let me know in the comments.)

ssh localhost 3.2. Format HDFS

Format the Hadoop Distributed File System (HDFS)

./hadoop-2.7.3/bin/hdfs namenode -format 3.3. Start Daemons

Ensure JAVA_HOME is properly set inhadoop-2.7.3/etc/hadoop/hadoop-env.sh. Put this line somewhere in the file.

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_71

This step assumes that password-less SSH is properly working.

$ sbin/start-dfs.sh

Something like the following will be printed as the diagnostic output:

localhost: starting namenode, logging to /home/hadoop/server/hadoop-2.7.3/lo...
localhost: starting datanode, logging to /home/hadoop/server/hadoop-2.7.3/lo...
Starting secondary namenodes [0.0.0.0]
...

Check that the processes are running with a ps -ef . You should see a namenode, a datanode and a secondardnamenode as follows:

/usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_namenode ...
/usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_datanode ...
/usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_secondarynamenode ... 3.4. Copy Input Files

Create directories in HDFS to copy files to (assuming joe is your username).

hadoop-2.7.3/bin/hdfs dfs -mkdir /user
hadoop-2.7.3/bin/hdfs dfs -mkdir /user/joe
hadoop-2.7.3/bin/hdfs dfs -mkdir input

Check that the input directory has been created.

hadoop-2.7.3/bin/hdfs dfs -ls
# prints
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2017-03-16 17:33 input

Copy files to be processed to HDFS.

hadoop-2.7.3/bin/hdfs dfs -put hadoop-2.7.3/etc/hadoop/*.xml input 3.5. Run Hadoop Job

The job can now be run as follows (search for word “ honour ” in the input text files):

JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'honour'

View the generated output files with the command:

./hadoop-2.7.3/bin/hdfs dfs -cat output/*
# prints
59 honour

And that’s it. Stop the daemons with the command:

./hadoop-2.7.3/sbin/stop-dfs.sh Summary

Hadoop is a big-data processing platform that can run on clusters of commodity hardware. In this tutorial we learnt how to get started with Hadoop by running it as a single process, and on a single node in distributed mode. In further articles, we will explore how to run it on a cluster of nodes.

Hadoop Tutorial

Trending Articles

[奇怪机翻组] 双梦相牵 / ふたりの夢もち [RJ01259078] [WebRip] [1080P HEVC-10Bit AAC 2.0]...

HONDA CITY VTI-S 菜單分享

#新闻拍一拍# 新的摩尔定律：黄氏定律

一如既往的痴情能否打动月瓶金蝎？ (豆瓣月亮水瓶小组)

求購按摩椅~'~

「粉红」不是霸凌辜莞允杠部落客：我爽在哪？

Intel 7-10代集成显卡驱动31.0.101.2137完整版

涉Gotbit加密货币市场操纵台男纽约被捕

臺灣法治會計學會2025年第三季研討會

不靠姊姊！張柏芝弟弟開計程車維生

关门一家亲：习远平、张澜澜、徐才厚

剑指offer——24.二叉树中和为某一值的路径

苏珊米勒日晕05.11｜狮子鼓励孩子；处女相信自己 (豆瓣 SUSAN MILLER小组)

【台積電IT卓越新戰略5】台積IT組織5年三次大調整，要靠平臺工程讓DevOps創新再加速

【日语无字】春之钟.Haru.no.kane.1985.JAP.vhsrip.NoSub.by.xiongzaixia&vivi

美籍老公不讓步李愛綺兒子念公立小學

新华网这张照片绝了!直讽江泽民宋祖英淫乱组图

湖州师范学院音乐学院开发的 Kontakt 8 明代魏氏乐琵琶/瑟/月琴音源即将发布

Google Chrome Portable 140.0.7339.186 穩定版免安裝中文版 - Google 瀏覽器

免费翻墙节点大全