Hadoop & Mapreduce Tutorials | Pig Latin Commands

Pig Latin Commands

Pig Latin statements are the basic constructs you use to process data using Pig. A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. (This definition applies to all Pig Latin operators except LOAD and STORE which read data from and write data to the file system.) Pig Latin statements may include expressions and schemas. Pig Latin statements can span multiple lines and must end with a semi-colon ( ; ). By default, Pig Latin statements are processed using multi-query execution. A Pig Latin program consists of a collection of statements. A statement can be thought of as an operation or a command. For example, a GROUP operation is a type of statement

grouped_records = GROUP records BY year;

Pig Latin statements are generally organized as follows:

  • A LOAD statement to read data from the file system. This operator loads data from the file or  If a directory name is specified, it loads all the files in the directory into the relation. If Pig is run in the local mode, it searches for the directories on the local File System; while in the MapReduce mode, it searches for the files on HDFS.
  • A series of “transformation” statements to process the data.
  • A DUMP statement to view results or a STORE statement to save the results. The DUMP operator is almost similar to the  STORE operator, but it is used specially to display results on the command prompt rather than storing it in a File System like the  STORE operator.  DUMP behaves in exactly the same way as STORE, where the Pig Latin statements actually begin execution after encountering the  DUMP operator. This operator is specifically targeted for the interactive execution of statements and viewing the output in real time.

A DUMP or STORE statement is required to generate output. The  STORE operator  has dual purposes, one is to write the results into the File System after completion of the data pipeline processing, and another is to actually commence the execution of the preceding Pig Latin statements. This happens to be an important feature of this language, where logical, physical, and MapReduce plans are created after the script encounters the STORE operator.

In this example Pig will validate, but not execute, the LOAD and FOREACH statements.

A = LOAD ‘student’ USING PigStorage() AS (name:chararray, age:int, gpa:float);

B = FOREACH A GENERATE name;

In this example, Pig will validate and then execute the LOAD, FOREACH, and DUMP statements.

A = LOAD ‘student’ USING PigStorage() AS (name:chararray, age:int, gpa:float);

B = FOREACH A GENERATE name;

DUMP B;

(John)

(Mary)

(Bill)

(Joe)

Pig Relations

Pig Latin statements work with relations. A relation can be defined as follows:

  • A relation is a bag (more specifically, an outer bag).
  • A bag is a collection of tuples.
  • A tuple is an ordered set of fields.
  • A field is a piece of data.

A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don’t require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.

Also note that relations are unordered which means there is no guarantee that tuples are processed in any particular order. Furthermore, processing may be parallelized in which case tuples are not processed according to any total ordering.

Apply for Big Data and Hadoop Developer Certification

https://www.vskills.in/certification/certified-big-data-and-apache-hadoop-developer

Back to Tutorials

Share this post
[social_warfare]
Implementation
Hadoop & Mapreduce Tutorials | UDF and data processing operator

Get industry recognized certification – Contact us

keyboard_arrow_up