“We are all different. Don’t judge, understand instead.”
Roy T. Bennett, The Light in the Heart
Contents
2. Single NodeOperation 3. Distributed Processing on Single Node 1. IntroductionHadoop is a toolkit for big-data processing. It uses a cluster of computers to split data into multiple chunks and process each chunk on one machine and re-assemble the output.
Let us now get started with Hadoop.
Download the Hadoop binary distribution from Apache servers.
2. Single NodeOperationLet us run a Hadoop job on a single node for understanding the basics of Hadoop processing. In this mode, the Hadoop data processing job runs on a single node as a single java process (no distributed computing).
We run a Hadoop job to find and count instances of a regular expression on a bunch of files.
2.1. SetupCreate a directory to be used as the working directory and change to that directory.
mkdir hadoop-getting-started && cd hadoop-getting-startedExtract the Hadoop binary tarball into this directory.A sub-directory such as hadoop-2.7.3 will be created containing the distribution.
tar xvzf hadoop-2.7.3.tar.gzEnsure that JAVA_HOME is set and exported.
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_71Run the Hadoop binary to ensure everything is setup correctly.
JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jarNote: The setting JAVA=java was required in the previous invocation since it appeared from the bin/hadoop executable shell script that this variable was being used. However there was no mention of this setting in the Hadoop docs (or at least, I didn’t see any). And the command would not run without this variable being set.
If everything has been setup correctly, you should see the following output from the above invocation.
An example program must be given as the first argument.Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
... 2.2. Running the Job
Let us run a sample job: attempting to find regular expression matches in files.
mkdir inputcp hadoop-2.7.3/etc/hadoop/*.xml input/
JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'a+'
The job should now start, print a bunch of diagnostic output and finish with a few seconds. Check the output:
cat output/*This command looks through all the files in the input directory and shows you the regular expression matches found in the input files.
3. Distributed Processing on Single NodeThis example runs as a distributed system (where each Hadoop daemon is a separate process), but on a single node. It serves to lay out the processing structure for a fully-distributed system on a cluster.
3.1. Password-less SSHGenerate a key-pair with the following command. It writes the private key of an RSA key-pair into ~/.ssh/id_rsa and the public key into ~/.ssh/id_rsa.pub .
You should enter a passphrase to protect the key. (And remember it!)
ssh-keygenAppend the public key file ~/.ssh/id_rsa.pub to ~/.ssh/authorized_keys file.
cat joe.pub >> ~/.ssh/authorized_keysEnsure that ~/.ssh/authorized_keys is private (only you can read/write the file).
chmod 600 ~/.ssh/authorized_keysAdd the private key file to your list of SSH identities. You will need to enter your passphrase here to do that.
ssh-add joeYou should now be able to login to the localhost without a password. (If it doesn’t work, let me know in the comments.)
ssh localhost 3.2. Format HDFSFormat the Hadoop Distributed File System (HDFS)
./hadoop-2.7.3/bin/hdfs namenode -format 3.3. Start DaemonsEnsure JAVA_HOME is properly set inhadoop-2.7.3/etc/hadoop/hadoop-env.sh. Put this line somewhere in the file.
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_71This step assumes that password-less SSH is properly working.
$ sbin/start-dfs.shSomething like the following will be printed as the diagnostic output:
localhost: starting namenode, logging to /home/hadoop/server/hadoop-2.7.3/lo...localhost: starting datanode, logging to /home/hadoop/server/hadoop-2.7.3/lo...
Starting secondary namenodes [0.0.0.0]
...
Check that the processes are running with a ps -ef . You should see a namenode, a datanode and a secondardnamenode as follows:
/usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_namenode .../usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_datanode ...
/usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_secondarynamenode ... 3.4. Copy Input Files
Create directories in HDFS to copy files to (assuming joe is your username).
hadoop-2.7.3/bin/hdfs dfs -mkdir /userhadoop-2.7.3/bin/hdfs dfs -mkdir /user/joe
hadoop-2.7.3/bin/hdfs dfs -mkdir input
Check that the input directory has been created.
hadoop-2.7.3/bin/hdfs dfs -ls# prints
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2017-03-16 17:33 input
Copy files to be processed to HDFS.
hadoop-2.7.3/bin/hdfs dfs -put hadoop-2.7.3/etc/hadoop/*.xml input 3.5. Run Hadoop JobThe job can now be run as follows (search for word “ honour ” in the input text files):
JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'honour'View the generated output files with the command:
./hadoop-2.7.3/bin/hdfs dfs -cat output/*# prints
59 honour
And that’s it. Stop the daemons with the command:
./hadoop-2.7.3/sbin/stop-dfs.sh SummaryHadoop is a big-data processing platform that can run on clusters of commodity hardware. In this tutorial we learnt how to get started with Hadoop by running it as a single process, and on a single node in distributed mode. In further articles, we will explore how to run it on a cluster of nodes.