Hadoop MapReduce Performance Tuning Best Practices

1. Objective

This tutorial on Hadoop MapReduce performance tuning best practices that will provide you ways for improving your Hadoop cluster performance and get best result from your programming in Hadoop. It will cover concepts like Memory Tuning inHadoop, Map Disk spill in Hadoop, tuning mapper tasks, Speculative execution inBig data hadoop and many other related concepts.

2. Introduction to Hadoop MapReduce Performance Tuning

Performance tuning in Hadoop will help you in optimizing your Hadoop cluster performance and make it better to provide best results while doing Hadoop programming inBig Data companies. To perform the same, you need to repeat below process till desired output is achieved at optimal way.

Run Job ― > Identify Bottleneck ― > Address Bottleneck.

First step is to runHadoop job, Identify the bottlenecks and address them using below methods to get highest performance. You need to repeat above step till a level of performance is achieved.

3. Tuning Hadoop run-time parameters

There are many options provided by hadoop on cpu, memory, disk, and network for performance tuning. Most hadoop tasks are not cpu bounded, what is most considered is to optimize usage of memory and disk spills.

a. Memory Tuning

The most general and common rule for memory tuning is: use as much memory as you can without triggering swapping. The parameter for task memory is mapred.child.java.opts that can be put in your configuration file.

You can also monitor memory usage on server using Ganglia,Cloudera manager, or Nagios for better memory performance.

b. Minimize the Map Disk Spill

Disk IO is usually the performance bottleneck in Hadoop. There are a lot of parameters you can tune for minimizing spilling like:

Compression of mapper output Usage of 70% of heap memory ion mapper for spill buffer

But do you think frequent spilling is a good idea?

It’s highly suggested not to spill more than once as if you spill once, you need to re-read and re-write all data: 3x the IO

c. Tuning mapper tasks

The number of mapper tasks is set implicitly unlike reducer tasks. The most common tuning way for the mapper is controlling the amount of mapper and the size of each job. When dealing with large files, hadoop split the file in to smaller chunk so that mapper can run it in parallel. However, initializing new mapper job usually takes few seconds that is also an overhead to be minimized. Below are the suggestions for the same:

Reuse jvm task Aim for map tasks running 1-3 minutes each. For this if the average mapper running time is lesser than one minute, increase the mapred.min.split.size, to allocate less mappers in slot and thus reduce the mapper initializing overhead. Use Combine file input format for bunch of smaller files. 4. Tuning application-specific performance a. Minimize your mapper output

Minimizing the mapper output can improve the general performance a lot as this is sensitive to disk IO, network IO, and memory sensitivity on shuffle phase.

For achieving this, below are the suggestions:

Filter the records on mapper side instead of reducer side. Use minimal data to form your map output key and map output value inMap Reduce. Compress mapper output b. Balancing reducer’s loading

Unbalanced reducer tasks create another performance issue. Some reducers take most of the output from mapper and ran extremely long compare to other reducers.

Below are the methods to do the same:

Implement a better hash function in Partitioner class. Write a preprocess job to separate keys using MultipleOutputs. Then use another map-reduce job to process the special keys that cause the problem. c. Reduce Intermediate data with Combiner in Hadoop

Implement a combiner to reduce data which enables faster data transfer.

d. Speculative Execution

MapReduce jobs get impacted when tasks take long time to finish the execution. This problem is being solved by approach of speculative execution by backing up slow tasks on alternate machines. You need to set the configuration parameters ‘mapreduce.map.tasks.speculative.execution’ and ‘mapreduce.reduce.tasks.speculative.execution’ to true for enabling speculative execution. This will reduce the job execution time if the task progress is slow due to memory unavailability.

5. Conclusion

There are several performance tuning tips and tricks for a Hadoop Cluster and we have highlighted some of the important ones. For more tricks to improve Hadoop cluster performance, check Job optimization techniques in Big data.

Hadoop MapReduce Performance Tuning Best Practices

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本