Apache Hadoop Interview Questions

1. What is Hadoop ecosystem and its building block elements ?

Answer :- The Hadoop ecosystem refers to the various components of the Apache Hadoop software library, as well as to the accessories and tools provided by the Apache Software Foundation for these types of software projects, and to the ways that they work together.

Core Hadoop elements are:

1. MapReduce - a framework for parallel prosessing vast amounts of data.

2. Hadoop Distributed File System (HDFS), a sophisticated distibuted file system.

3.YARN, a Hadoop resource manager.

In addition to these core elements of Hadoop, Apache has also delivered other kinds of accessories or complementary tools for developers. These include Apache Hive, a data analysis tool; Apache Spark, a general engine for processing big data; Apache Pig, a data flow language; HBase, a database tool; and also Ambarl, which can be considered as a Hadoop ecosystem manager, as it helps to administer the use of these various Apache resources together.

2. What is fundamental difference between classic Hadoop 1.0 and Hadoop 2.0 ?

Answer:-

Hadoop 1.X Hadoop 2.X Limited up to 4000 nodes per cluster Potentially up to 10000 nodes per cluster Supports only for MapReduce processing model. Along with MapReduce processing model, support added for other distributed computing models(non MR) like Spark, Hama, Giraph, Message Passing Interface) MPI & HBase co-processors. Job tracker is bottleneck in Hadoop 1.x - responsible for resource management, scheduling and monitoring.(MR does both processing and cluster resource management.) YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done using different processing models. Efficient cluster utilisation achieved using YARN. Map Reduce slots are static. A given slots can run either a Map task or a Reduce task only. Works on concepts of containers. Using containers can run generic tasks. Only one namespace for managing HDFS. Multiple namespace for managing HDFS. Because of singleNameNodeit might lead of single point offailure and in case ofNameNodefailure, needsmanual intervention. SPOF overcome with a standbyNameNodeand in caseofNameNodefailure, it is configured for automaticrecovery.
3. What is Job tracker and Task tracker. How are they used in Hadoop cluster ?

Answer :- Job Tracker is a daemon that runs on a Namenode for submitting and tracking MapReduce jobs in Hadoop. Some typical tasks of Job Tracker are:

- Accepts jobs from clients

- It talks to the NameNode to determine the location of the data.

- It locates TaskTracker nodes with available slots at or near the data.

- It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker.

Task tracker is a daemon that runs on Datanodes. It accepts tasks like Map, Reduce and Shuffle operations - from a Job Tracker. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialise the job and divide the work and assign them to different task trackers to perform MapReduce tasks.While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

4.Whats the relationship between Jobs and Tasks in Hadoop ?

Answer :- In hadoop Jobs are submitted by client and Jobs are split into tasks likeMap, Reduce and Shuffle.

5. What is HDFS (Hadoop distributed file system)? WhyHDFS is termed as Block structured file system ? What is default HDFS block size ?

Answer :-HDFS is a file system designed for storing very large files. HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware (Commodity hardware is a non-expensive system with not high-availability).

HDFS is termed as Block structured file system because individual files are broken into blocks of fixed size (default block size of an HDFS block is 64 MB). These blocks are stored across a cluster of one or more machines with data storage capacity.Changing the dfs.block.size property in hdfs-site.xml will change the default block size for all the files placed into HDFS.

6. What is significance of fault tolerance and high throughput in HDFS ?

Answer :- Fault Tolerance : - In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.

Throughput :- Throughput is the amount of work done in a unit time.In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput.

7.What does "Replication factor" mean in Hadoop? What is default replication factor in HDFS ? How to modify default replication factor in HDFS ?

:- The number of times a file needs to be replicated in HDFS is termed as replication factor.

Default replication factor in HDFS is 3. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.

The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

We can change the replication factor on a per-file basis and on all files in the directory using hadoop FS shell.

$ hadoop fs setrep w 3 /MyDir/file

$ hadoop fs setrep w 3 -R /RootDir/Mydir

8. What is Datanode and Namenode in HDFS ?

Answer: - Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

Namenode is the master node on which job tracker runs and stores metadata about actual storage of data blocks, so that it can manages the blocks which are present on the datanodes. It is a high-availability machine,Namenode can never be a commodity hardware because the entire HDFS rely on it so ithas to be a high-availability machine.

Month Savings January $100 February

Latest Images

Trending Articles

Latest Images