Quantcast
Channel: CodeSection,代码区,数据库(综合) - CodeSec
Viewing all articles
Browse latest Browse all 6262

Solving Common BI on Hadoop Performance Pitfalls

$
0
0
Introduction

Your company has recently made the jump to big data and now that a few months, or even years, have passed you think your team has a firm grip on getting data in/out of the environment for use, right? Most likely if you are thinking this you are probably also discovering that you aren’t really sure how to easily hook one of your Business Intelligence (BI) tools to this data, against data you really want i.e., a subset, and most produce fast and rapid-response results. These three simple tasks are almost universal to any BI request and can wreak havoc on teams in the big data age with so many services available but none providing a solution to any of these.

Oftentimes this will lead us elsewhere, purchasing expensive alternatives from third party companies to manipulate, display, and provide meaningful access to the data for our users and ourselves and in most cases what are we left with? The sad truth of it is that we’re left with supporting multiple products or going back to what is familiar like SQL Server or other new in-memory alternatives such as Microsoft’s Analysis Services Tabular models; neither of which we want to do.

In this article, we’ll explore some of the various ways we can answer the question of how do we do BI from big data and compare alternatives with Jethro’s approach. Most of the alternatives mentioned in this piece have been explored and even implemented (sad to admit) by me before big data really grew into what is available now. So please enjoy and feel free to provide input or email comments and share experiences with me and the rest of the Jethro team so we can potentially advise and help through similar situations and avoid the pitfalls described here.

Core Hadoop

One of the most common, and frankly nave, ways people approach BI with Hadoop, which generally ends in failure, is using core Hadoop to power BI reporting with tools such as Tableau, SSRS, or some other third party system like Business Objects, etc. When we talk about core Hadoop we are referring to the following services:

HDFS MapReduce/MR2 YARN Hive/HiveServer 2 Zookeeper

Now don’t get me wrong, MR2/YARN and the others listed above are excellent for what they are meant to do, but one of these services is often misrepresented for being a multipurpose solution that it is not: HIVE. The purpose of HIVE is simple, batch processing in/out that’s it. Working with a company a few years back there weren’t many alternatives and I remember seeing demos with HIVE tables and views as the source to a Tableau dashboard. At the time, we were looking for alternatives to present the data quicker from HIVE into Tableau and ultimately ended up settling on SQL Server Analysis Services Tabular (more to come on this). I can clearly remember asking the presenter to manipulate filters via screen share and then the awkward waiting as one, two, five minutes passed as we waited for a MapReduce job to run in the background unbeknownst to the audience for the new data.

One of the major problems with this solution is the fact that Tableau, and other tools that rely on queries to refresh data, will cause MapReduce jobs to be launched as users explore and elect to change the initial set of data. There are clever ways around this, extracting the data all up front, etc., but oftentimes the cost for doing so is not worth it.

For example, take the connection below to a table with ~75M rows:


Solving Common BI on Hadoop Performance Pitfalls

Connecting Tableau to Hive + Spark*


Solving Common BI on Hadoop Performance Pitfalls

The screenshot illustrates how long the wait was for metadata generation (I killed it after 2:00 minutes). One could argue that limiting the general import would fix this issue, i.e., if I did a LIMIT 500,000 I can pull the data through, and that is true. But what do you do when your smallest dataset initially starts at 25M, 50M, or greater?

You can spend many man hours on attempting to make this work with partitions, shrinking data into smaller tables, or other downstream tasks but none will ever prove sustainable nor provide the speed many of your customers are expecting.

Retro Solutioning

Retro Solutioning, which may be a coined term (not sure to be honest), is when you don’t know enough about the other Hadoop services outside of the core features and decide to stick with what’s “safe.” The resulting solution will utilize technologies your team has at hand and is comfortable with and is usually the second line of attack when initial plans don’t quite work out (see the previous section).

Although there are valid situations that may call for utilizing existing technology and every company and situation is unique, more often than not it proves to be additional work and, by nature, additional processes for you to manage. Not only does this completely negate the purpose of Hadoop and big data but it wastes the time and resources of your team and has a high likelihood of becoming a band-aide solution instead of a scalable one.

Being a Microsoft SQL developer for 10+ years prior to getting into big data I felt this example, derived from my own experience, was very relevant to this topic and involves utilizing Microsoft SQL Server and the Tabular Analysis Services features to make up for less than desired performance out of Hive. We’ll start by outlining the hypothetical project and then go into the implemented solution and the results afterwards. As you read through it, think of how you’d approach this problem and if it lines up or not (if it doesn’t line up then you may have did it right!)

The “Hypothetical” Project

Project: Reporting Dashboards

Original Systems:

Microsoft SQL Server (DB)

Microsoft .Net for dashboards yikes, I know

Microsoft Reporting Services

Project Overview:

Take existing reports that have an MS SQL Server database source and upgrade them to Tableau, deprecating any use of .NET as a dashboarding interface and removing any reporting dependency on MS Reporting Services.

Project Requirements: All reports should load in <10 seconds on initial browse All reports should respond to filter changes within <10 seconds Data source should hold at least 1 year for users to pull in active reports/dashboards Solution should be able to handle a user Base = 1000 users with 200 being concurrent

The total data size is 225 million rows

Seems straightforward enough, right? The answer: Maybe.

Knowing that plugging your reporting tool, Tableau in for this case, into HS2 is a no-no for BI when a performance SLA is involved, we needed to come up with a way to solve for this quickly. Since we had SSAS talent and owned pretty beefy machines we decided to aggregate where possible and push to memory using the in-memory Tabular SSAS. Acknowledging that there would be overhead to develop the exports of data from Hive using ODBC and SQL Linked Servers and there would be duplication of data between what was stored in HDFS and Tabular we set off on our journey. The first thing that hurt us the most was the time it took extracting data from Hive into the in-memory area for SSAS and we were hit with the almost same overhead if we were to have just connected Tableau directly to HS2.

Here is a breakdown of the different systems via SQL Linked Server and ODBC. Note: Impala nor Jethro were available to us at the time b/c we didn’t even consider exploring them!


Solving Common BI on Hadoop Performance Pitfalls
A couple things to note from the above; 1. The query times are using OPENQUERY to import data into SSAS, 2. Each resulting set was against an 80M row fact table unrelated to the one from our hypothetica

Viewing all articles
Browse latest Browse all 6262

Trending Articles