Thank you for reading our Data Lake 3.0 series! In part 1 of the series , we briefly introduced the power of leveraging prepackaged applications in Data Lake 3.0 and how the focus will shift from the platform management to solving the business problems. In this post, we further deliberate on this idea to help answer questions on how a multi-colored YARN will play a critical role in building such a successful Data Lake 3.0.
IntroductionApache Hadoop TM YARN is the modern resource-management platform that enables applications to share a common infrastructure of servers and storage. YARN is now morphing into a multi-colored platform of choice ! YARN’s vision has always been to enable Hadoop to run many different workloads. The next steps in the journey are about dialing up the workload diversity and in making the creation and deployment of modern data apps easy. Without further ado, let’s first recap on how YARN acted as the platform of choice thus far, before elaborating on the evolution of a multi-colored YARN as part of Data Lake 3.0.

Towards a multi-colored YARN: Apps, Services and Assemblies Data Lake 2.0 YARN as the Platform
Apache Hadoop YARN is built as a general purpose resource management platform. YARN’s core concepts are applications, containers, and resources. A container is a virtualized execution environment where a set of processes or tasks utilize the physical resources of the underlying machine. Administrators set up a bunch of machines to support multiple such containers. Users then write applications, each to be a set of tasks or processes executing in a collection of containers.
“YARN’s core concepts applications, containers and resources.”
Making use of these concepts of applications and containers, YARN has been used successfully to run all sorts of data applications. These applications can all coexist on a shared infrastructure managed through YARN’s centralized scheduling.
YARN is being used in production at a wide variety of organizations to host a wide variety of data-intensive applications such as batch workloads ( Hadoop MapReduce ), interactive query processing ( Apache Hive , Apache Tez , Apache Spark ) and real-time processing ( Apache Storm ). For those of you who are familiar with its history, it originated out of a need to evolve Hadoop to support not just MapReduce but any arbitrary processing engine. As more engines mentioned above came to the fore over time, YARN’s core architectural design has served the needs of these engines well, needing only occasional incremental improvements. Over the years, YARN has easily supported a wide spectrum of frameworks.

YARN’s key strengths The power of YARN is not limited to just enabling all these different programming paradigms on shared datasets (typically over a distributed storage system like HDFS) and physical hardware. YARN brings to the table a variety of platform features that users rely on for an end-to-end big data success story. YARN can use its key strengths cost effective resource management, powerful scheduling primitives, resource isolation and multi-tenancy on a myriad of resources, varying from small pools of special-purpose machines to datacenter-scale infrastructure built out of commodity hardware.
Data Lake 3.0 A Multi-Colored YARN
YARN is the Data Operating System that powers our Data Lake 3.0 vision. While YARN has initially focused on large scale but short-running apps (often also referred to simply as jobs), it is also the perfect platform to run long-running services as well as apps that have a mix of both. YARN’s scheduler and its key abstractions are general enough to support running a variety of application including batch, longer running streaming and classical services. However, what separates YARN from others is its special support for data intensive applications.
The scheduler supports rich placement strategies like data locality so that applications can be placed close to its data. It has sophisticated algorithms to allow efficient and incremental exchange of data locality information from data intensive applications as they progresses through their multiple parallel phases, while dealing with the fact that data is often also replicated. YARN offers a distributed-cache for caching both binaries and data on the local machines where the real work is done. YARN provides local temporary storage that data processing apps can leverage. This is also useful for longer-running service-like applications.Extending YARN’s inherent capabilities to handle data intensive applications, we are seeing significant signals of a perfect storm enabled by two major drivers. On the business front, our advanced users are looking to solve end-to-end business problems as the next phase in the big data maturity curve. On the technology front, we are seeing the wide adoption of containerized workloads that provides the ease of distribution, packaging and isolation. We next discuss more about both these drivers.
Market Drivers: End-to-end business use-casesLet’s revisit the historical way the Hadoop ecosystem has been built over time. Since the beginning, the Apache ecosystem has focused on singular storage and compute engines, each addressing a specific problem in the larger big-data space. This is akin to the unix mantra of “ doing one thing and doing it well ”. So far, this approach has served well the developer community and the user-base. Developer community could zoom into a single (set of) problem(s) with undivided attention and solve them all the way through. Users could then bring these different but ultimately working-well-together tools in addressing their business use-cases.
Use-cases evolutionDuring the past few years though, end-to-end business use-cases have evolved to another level.
The end-to-end business problems are now mostly solved by multiple applications working together . As the platform matured, users have increasingly started wanting to solely focus on the business application layers , and getting impatient to get on with developing their main business-logic. However, YARN, and for that matter any other related platform, hasn’t catered to this evolving need, leaving the users to unwillingly get involved in the painstaking details of wiring applications together, keeping them up, manually scaling them as need arises etc. Manual plumbing of all these different colored services in tiresome! Further, there is a clear need for seamless aggregate deployment, lifecycle management and application wireup. This is the gap that needs to be bridged between what these end-to-end business use-cases need from the platform and what the platform offers today. If these features are provided, then th