Big Data Use Cases Real-life Case Studies for Apache Hadoop Spark and Flink

1. Objective

In this tutorial we will talk about real-life case studies of Big data, Hadoop, Apache Spark and Apache Flink. This tutorial will brief about the various diverse use cases where industry is using different Big Data tools (like hadoop, spark, flink, etc.) to solve the specific problems. Learn from the real case studies how to solve problems related to Big Data. Apache Flink use cases are mainly focused on real-time analytics, while Apache Spark use cases are focused on complex iterative machine learning algorithm implementations. Apache Hadoop use cases concentrates on handling huge volumes of data efficiently.

Big Data Use Cases Real-life Case Studies for Apache Hadoop Spark and Flink

2. Credit Card Fraud Detection

As millions of people are using credit card now-a-days, so it has become very necessary to protect people from frauds. It has become a challenge for Credit card companies to identify whether the requested transaction is fraudulent or not.

A credit card transaction hardly takes 2-4 seconds in completion. So the companies need innovative solution to identify the transactions which may appear as fraud in this small time and thus protect their customers from becoming its victim.

An abnormal number of clicks from the same IP address or a pattern in the access times ― although this is the most obvious and easily identified form of click fraud, it is amazing how many fraudsters still use this method, particularly for quick attacks. They may choose a to strike over a long weekend when they figure you may not be watching your log files carefully, clicking on your ad repeatedly so that when you return to work on Tuesday, your account is significantly depleted. Part of this fraud might be unintentional when a user tries to reload a page.

Again if you have made any transaction from Mumbai today and the very next minute there is a transaction from your card in Singapore. Then there are chances that this transaction may be fraud and not done by you. So companies need to process the data in real time (Data in Motion analytics DIM) and analyse it against individual history in a very short span of time and identify whether the transaction is actually fraud or not. Accordingly companies can accept or decline the transaction based on the severity.

To process the data streams we need streaming engines like Apache Flink. The streaming engine can consume the real-time data streams at very high efficiency and process the data in low latency (without any delay). Follow this Flink tutorial to learn more about Apache Flink.

3. Sentiment Analysis

Sentiment analysis provides substance behind social data. A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level ― whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, “beyond polarity” sentiment classification looks, for instance, at emotional states such as “angry,” “sad,” and “happy.”

In sentiment analysis, language is processed to identify and understand consumer feelings and attitudes towards brands or topics in online conversations ie, what they are thinking about a particular product or service, whether they are happy or not with it, etc.

For example, if a company is launching a new product, it can find what its customers are thinking about the product. Whether they are satisfied with it or not or they would like to have some modifications in it can be found out using Big data by doing sentiment analysis ie, using sentiment analysis we can identify users’ opinion about the same. Then the company can take action accordingly to modify or improve the product to increase their sales and to make customers fell happy with their product.

Below is a real example of sentiment analysis:

A large airline company started monitoring tweets about their flights to see how customers are feeling about upgrades, new planes, entertainment, etc. Nothing special there, except when they began feeding this information to their customer support platform and solving them in real-time.

One memorable instance occurred when a customer tweeted negatively about lost luggage before boarding his connecting flight. They collect the tweets (having issues) and offer him a free first class upgrade on the way back. They also tracked the luggage, and gave information on where the luggage was, and where they would deliver it. Needless to say, he was pretty stoked about it, and tweeted like a happy camper throughout the rest of his trip.

WithHadoop, you can mine Twitter, Facebook and other social media conversations for sentiment data about you and your competition, and use it to make targeted, real-time, decisions that increase market share. With the help of quick analysis of customer sentiment through social media , company can immediately take decision and action and they need not wait for the sales report (which might take 6 or more months also)as earlier to run their business in a better manner.

4. Data Processing (Retail)

Let us now see an application for Leading Retail Client in India. The client was getting invoice data daily which was of about 100 GB size and was in XML format. To generate report from the data, conventional method was taking about 10 hours time and client had to wait for this time to get the report from the data.

This conventional method was developed in C and was taking a huge time which was not a feasible solution and the client was not happy with it. The invoice data was in XML format which needs to be transformed into structured format before generating report. This involved validation, verification of data and implementation of complex business rules.

In today’s world when things are expected to be available anytime when required, waiting for 10 hours was not a proper and acceptable solution. So the client approached Big data team of one of the companies with their problem and with a hope to get a better solution. Client was even able to accept time reduced from 10 hours to 5 hours or lil more also.

When Big Data team started working on their problem and approached them back with the solution, the client was amazed and could not believe that the report which they were getting in 10 hours could now be received in just 10 minutes using Big Data andHadoop. The team used a cluster of 10 nodes for the data getting generated and now the time taken to process data was just 10 minutes. So you can imagine the speed and efficiency ofBig Data in today’s world.

5. Orbitz.com

Orbitz is a leading travel company using latest technologies to transform the way clients around the world plan the travel. They operate the customer travel planning sites Orbitz, Ebookers and CheapTickets.

It generates 1.5mn flight searches and 1mn hotel searches daily and the log data being generated by this activity is approximately 500GB in size. The raw logs are only stored for a few days because of costly data warehousing. To handle such huge data and to store it using conventional data warehouse storage and analysis infrastructure was becoming more expensive and time consuming with time.

For example to search hotel in database using conventional approach which was developed in Per/ Bash, extraction need to be done serially. The time it was taking to process and sort hotels based on just last 3 months data was also 2 hours which was again not acceptable and feasible solution today when customers are expecting results to be generated on just their click.

This problem was again very big and needed some solution to protect the company from losing their customers. Orbitz needed