Data Lake 3.0 Part 2 A Multi-Colored YARN

Thank you for reading our Data Lake 3.0 series! In part 1 of the series , we briefly introduced the power of leveraging prepackaged applications in Data Lake 3.0 and how the focus will shift from the platform management to solving the business problems. In this post, we further deliberate on this idea to help answer questions on how a multi-colored YARN will play a critical role in building such a successful Data Lake 3.0.

Introduction

Apache Hadoop TM YARN is the modern resource-management platform that enables applications to share a common infrastructure of servers and storage. YARN is now morphing into a multi-colored platform of choice ! YARN’s vision has always been to enable Hadoop to run many different workloads. The next steps in the journey are about dialing up the workload diversity and in making the creation and deployment of modern data apps easy. Without further ado, let’s first recap on how YARN acted as the platform of choice thus far, before elaborating on the evolution of a multi-colored YARN as part of Data Lake 3.0.

Data Lake 3.0 Part 2 A Multi-Colored YARN

Towards a multi-colored YARN: Apps, Services and Assemblies Data Lake 2.0 YARN as the Platform

Apache Hadoop YARN is built as a general purpose resource management platform. YARN’s core concepts are applications, containers, and resources. A container is a virtualized execution environment where a set of processes or tasks utilize the physical resources of the underlying machine. Administrators set up a bunch of machines to support multiple such containers. Users then write applications, each to be a set of tasks or processes executing in a collection of containers.

“YARN’s core concepts applications, containers and resources.”

Making use of these concepts of applications and containers, YARN has been used successfully to run all sorts of data applications. These applications can all coexist on a shared infrastructure managed through YARN’s centralized scheduling.

YARN is being used in production at a wide variety of organizations to host a wide variety of data-intensive applications such as batch workloads ( Hadoop MapReduce ), interactive query processing ( Apache Hive , Apache Tez , Apache Spark ) and real-time processing ( Apache Storm ). For those of you who are familiar with its history, it originated out of a need to evolve Hadoop to support not just MapReduce but any arbitrary processing engine. As more engines mentioned above came to the fore over time, YARN’s core architectural design has served the needs of these engines well, needing only occasional incremental improvements. Over the years, YARN has easily supported a wide spectrum of frameworks.

YARN’s key strengths The power of YARN is not limited to just enabling all these different programming paradigms on shared datasets (typically over a distributed storage system like HDFS) and physical hardware. YARN brings to the table a variety of platform features that users rely on for an end-to-end big data success story. YARN can use its key strengths cost effective resource management, powerful scheduling primitives, resource isolation and multi-tenancy on a myriad of resources, varying from small pools of special-purpose machines to datacenter-scale infrastructure built out of commodity hardware.
Data Lake 3.0 A Multi-Colored YARN

YARN is the Data Operating System that powers our Data Lake 3.0 vision. While YARN has initially focused on large scale but short-running apps (often also referred to simply as jobs), it is also the perfect platform to run long-running services as well as apps that have a mix of both. YARN’s scheduler and its key abstractions are general enough to support running a variety of application including batch, longer running streaming and classical services. However, what separates YARN from others is its special support for data intensive applications.

The scheduler supports rich placement strategies like data locality so that applications can be placed close to its data. It has sophisticated algorithms to allow efficient and incremental exchange of data locality information from data intensive applications as they progresses through their multiple parallel phases, while dealing with the fact that data is often also replicated. YARN offers a distributed-cache for caching both binaries and data on the local machines where the real work is done. YARN provides local temporary storage that data processing apps can leverage. This is also useful for longer-running service-like applications.

Extending YARN’s inherent capabilities to handle data intensive applications, we are seeing significant signals of a perfect storm enabled by two major drivers. On the business front, our advanced users are looking to solve end-to-end business problems as the next phase in the big data maturity curve. On the technology front, we are seeing the wide adoption of containerized workloads that provides the ease of distribution, packaging and isolation. We next discuss more about both these drivers.

Market Drivers: End-to-end business use-cases

Let’s revisit the historical way the Hadoop ecosystem has been built over time. Since the beginning, the Apache ecosystem has focused on singular storage and compute engines, each addressing a specific problem in the larger big-data space. This is akin to the unix mantra of “ doing one thing and doing it well ”. So far, this approach has served well the developer community and the user-base. Developer community could zoom into a single (set of) problem(s) with undivided attention and solve them all the way through. Users could then bring these different but ultimately working-well-together tools in addressing their business use-cases.

Use-cases evolution

During the past few years though, end-to-end business use-cases have evolved to another level.

The end-to-end business problems are now mostly solved by multiple applications working together . As the platform matured, users have increasingly started wanting to solely focus on the business application layers , and getting impatient to get on with developing their main business-logic. However, YARN, and for that matter any other related platform, hasn’t catered to this evolving need, leaving the users to unwillingly get involved in the painstaking details of wiring applications together, keeping them up, manually scaling them as need arises etc. Manual plumbing of all these different colored services in tiresome! Further, there is a clear need for seamless aggregate deployment, lifecycle management and application wireup. This is the gap that needs to be bridged between what these end-to-end business use-cases need from the platform and what the platform offers today. If these features are provided, then th

Data Lake 3.0 Part 2 A Multi-Colored YARN

Trending Articles

[奇怪机翻组] 双梦相牵 / ふたりの夢もち [RJ01259078] [WebRip] [1080P HEVC-10Bit AAC 2.0]...

HONDA CITY VTI-S 菜單分享

#新闻拍一拍# 新的摩尔定律：黄氏定律

一如既往的痴情能否打动月瓶金蝎？ (豆瓣月亮水瓶小组)

求購按摩椅~'~

「粉红」不是霸凌辜莞允杠部落客：我爽在哪？

Intel 7-10代集成显卡驱动31.0.101.2137完整版

涉Gotbit加密货币市场操纵台男纽约被捕

臺灣法治會計學會2025年第三季研討會

不靠姊姊！張柏芝弟弟開計程車維生

关门一家亲：习远平、张澜澜、徐才厚

剑指offer——24.二叉树中和为某一值的路径

苏珊米勒日晕05.11｜狮子鼓励孩子；处女相信自己 (豆瓣 SUSAN MILLER小组)

【台積電IT卓越新戰略5】台積IT組織5年三次大調整，要靠平臺工程讓DevOps創新再加速

【日语无字】春之钟.Haru.no.kane.1985.JAP.vhsrip.NoSub.by.xiongzaixia&vivi

美籍老公不讓步李愛綺兒子念公立小學

新华网这张照片绝了!直讽江泽民宋祖英淫乱组图

湖州师范学院音乐学院开发的 Kontakt 8 明代魏氏乐琵琶/瑟/月琴音源即将发布

Google Chrome Portable 140.0.7339.186 穩定版免安裝中文版 - Google 瀏覽器

免费翻墙节点大全