Pig Scripting in Big Data-Big Data - (PART - 5)
In this article, we will discuss Pig
Pig
Pig is a scripting platform designed to process and analyze large datasets, and it runs on Hadoop clusters. Pig is extensible, easily programmed, and self-optimizing.
Why Pig?
Before 2006, programs were written only on MapReduce using Java. Developers used to face a lot of challenges like:
- Code difficulty
- Rigid Dataflow
- Need for common operation
- Fundamentals while creating a program
Pig was developed to overcome these challenges.
Features of Pig:
- Supports UDFs and data types.
- Provide step-by-step procedural control
- Schemas can be assigned dynamically.
Pig Architecture
- Parser - It does the type checking and checks the syntax of the script.
- Optimizer - It performs activities like split, merge, transform, recorder, etc.
- Compiler - It compiles the optimized code into a series of MapReduce jobs.
- Execution Engine - It executes the MapReduce jobs on Hadoop to produce the desired results.
Stages of Pig Operations
- Load data and write Pig script:
A= Load 'mytxt' AS(x,y,z);
B= Filter A by x>0;
Store B INTO 'output';
- Pig Operations:
1. Parses and check the script
2. Optimizes the script
3. Plans execution
4. Submits to Hadoop
5. Monitors job progress
- Execution of the plan:
Results are stored in HDFS or dumped on screen.
Data Model Supported by Pig
- Atom - A simple atomic value. Example: 'Steve'.
- Tuple - A sequence of fields that can be of any data type. Example: ('Steve', 142)
- Bag - A collection of tuples of potentially varying structures that can contain duplicates. Example: {('Steve'),('Roger',(45,78))}
- Map - An associative array- the key must be a chararray, but the value can be of any type. Example: [name#Steve, age#32]
Pig Execution Modes
- Local Mode - Pig depends on the OS file system.
- MapReduce Mode - Pig depends on the HDFS.
Conclusion
- Pig is a high-level data flow scripting language and it has two major components: Runtime engine and Pig Latin language.
- Pig runs in two execution modes: MapReduce and Local.
- All the three parameters need to be followed before setting the environment for Pig Latin: ensure that all Hadoop services are running properly, Pig is completely installed and configured, and all required datasets are uploaded in the HDFS.