Quantcast
Channel: CodeSection,代码区,数据库(综合) - CodeSec
Viewing all articles
Browse latest Browse all 6262

Pig Tutorial: Apache Pig Architecture & Twitter Case Study

$
0
0
APACHE PIG TUTORIAL

As we mentioned in our Hadoop Ecosytem blog, Apache Pigis an essential part of our Hadoop ecosystem. So now, I would like to take you through the Apache Pig tutorial, which is a part ofour Hadoop Tutorial Series and would like to explain the importance of Apache Pig.In this Apache Pig Tutorial blog, I will talk about:

Apache Pigvs MapReduce Introduction to Apache Pig Where to use Apache Pig? TwitterCase Study Apache Pig Architecture Pig Latin Data Model Apache Pig Schema
Pig Tutorial: Apache Pig Architecture & Twitter Case Study

Fig: Apache Pig-Twitter Case Study (discussed later in this blog)

Before starting with the Apache Pig tutorial, I would like you to ask yourself a question “ while MapReduce was there for data analysis why Apache Pig came into picture? “

The sweet and simple answer to this is:

approximately 10 lines of Pig code isequal to 200 lines of MapReduce code .

Writing MapReduce jobs in Java is not an easy task for everyone. If you want a taste of MapReduce Java code, click here and you will understand the complexities. Thus, Apache Pig emerged as a boon for programmers who were not good with Java or python. Even if someone who knows Java and is good with MapReduce, they will also prefer Apache Pig due to the ease working with Pig. Let us take a look now.

Apache PigTutorial: Apache Pig vs MapReduce

Programmers face difficulty writingMapReduce tasks as it requires Java or Python programming knowledge. For them,Apache Pig is a savior.

Pig Latin is a high-level data flow language, whereas MapReduce is a low-level data processing paradigm. Without writing complex Java implementations in MapReduce, programmers can achieve the same implementationsvery easily using Pig Latin. Apache Pig uses multi-query approach (i.e. using a single query of Pig Latin we can accomplish multiple MapReduce tasks), which reduces the length of the code by 20 times. Hence, this reduces the development period by almost 16 times. Pig provides many built-in operators to support data operations like joins, filters, ordering, sorting etc. Whereas to perform the same function in MapReduce is ahumongous task. Performing a Join operation in Apache Pig is simple. Whereas it is difficult in MapReduce to perform a Join operation between the data sets, as it requires multiple MapReduce tasks to be executed sequentially to fulfill the job. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce. I will explain you these data types in a while.

Nowthat we know why Apache Pig came into the picture, you would be curious to know what is Apache Pig? Let us move ahead in this Apache Pig tutorial blog and go through the introduction and features of Apache Pig.

Apache PigTutorial: Introduction

Apache Pig is a platform, used to analyze large data sets representing them as data flows. It is designed to provide an abstraction over MapReduce, reducing the complexities of writing aMapReduce program. We can perform data manipulation operations very easily in Hadoop using Apache Pig.

The feature of Apache pig are:

Pig enables programmers to write complex data transformations without knowing Java. Apache Pig has two main components the Pig Latin language and the Pig Run-time Environment, in which Pig Latin programs are executed. For data analysis, Pig gives a simple data flow language known as Pig Latin which has functionalities similar to SQL like join, filter, limit etc. Developers who are working with scripting languages and SQL, leverages Pig Latin. This gives developers e ase of programming with Apache Pig. Pig Latin provides various built-in operators like join, sort, filter, etc to read, write, and process large data sets. Thus it is evident, Pig has a r ich set of operators . Programmers write scripts using Pig Latin to analyze data and these scripts are internally converted to Map and Reduce tasks by Pig MapReduce Engine. Before Pig, writing MapReduce tasks was the only way to process the data stored in HDFS. If a programmer wants to write custom functions which isunavailable in Pig, Pig allows them to write User Defined Functions ( UDF ) in anylanguage of their choice like Java, Python, Ruby, Jython, JRuby etc and embed them in Pig script written in Pig Latin. This provides e xtensibility to Apache Pig. Pig can process any kind of data, i.e. structured, semi-structured or unstructured data, coming from various sources. Apache Pig h andles all kinds of data . Approximately, 10 lines of pig code is equal to 200 lines of MapReduce code . It can handle inconsistent schema (in case of unstructured data). ApachePig extracts the data, performs operations on that data and dumps the data in the required format in HDFS i.e. ETL (Extract Transform Load) . Apache Pig automatically optimizes the tasks beforeexecution, i.e. automatic optimization . It allows programmers and developers to concentrate upon the whole operation irrespective of creating mapper and reducer functions separately.

After knowing what isApache Pig, now let us understand where we can use Apache Pig and what are the use cases which suits Apache Pig the most?

Apache Pig Tutorial: Where to use Apache Pig?

Apache Pig is used for analyzing and performing tasks involving ad-hoc processing. Apache Pig is used

Where we need to process, huge data sets like Web logs, streaming online data, etc. Where we need Data processing for search platforms (different types of data needs to be processed) like Yahoo uses Pig for 40% of their jobs including news feeds and search engine . Where we need to process time sensitive data loads. Here, data needs to be extractedand analyzed quickly. E.g. machine learning algorithms requires time sensitive data loads, like twitterneeds to quickly extract data of customer activities (i.e. tweets, re-tweets and likes) and analyze the data to find patterns in customer behaviors and make recommendations immediately

Viewing all articles
Browse latest Browse all 6262

Trending Articles