Pig Latin statements are the basic constructs you use to process data using Pig. A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. (This definition applies to all Pig Latin operators except LOAD and STORE which read data from and write data to the file system.) Pig Latin statements may include expressions and schemas. Pig Latin statements can span multiple lines and must end with a semi-colon ( ; ). By default, Pig Latin statements are processed using multi-query execution. A Pig Latin program consists of a collection of statements. A statement can be thought of as an operation or a command. For example, a GROUP operation is a type of statement
grouped_records = GROUP records BY year;
Pig Latin statements are generally organized as follows:
- A LOAD statement to read data from the file system. This operator loads data from the file or directory. If a directory name is specified, it loads all the files in the directory into the relation. If Pig is run in the local mode, it searches for the directories on the local File System; while in the MapReduce mode, it searches for the files on HDFS.
- A series of “transformation” statements to process the data.
- A DUMP statement to view results or a STORE statement to save the results. The DUMP operator is almost similar to the STORE operator, but it is used specially to display results on the command prompt rather than storing it in a File System like the STORE operator. DUMP behaves in exactly the same way as STORE, where the Pig Latin statements actually begin execution after encountering the DUMP operator. This operator is specifically targeted for the interactive execution of statements and viewing the output in real time.
A DUMP or STORE statement is required to generate output. The STORE operator has dual purposes, one is to write the results into the File System after completion of the data pipeline processing, and another is to actually commence the execution of the preceding Pig Latin statements. This happens to be an important feature of this language, where logical, physical, and MapReduce plans are created after the script encounters the STORE operator.
In this example Pig will validate, but not execute, the LOAD and FOREACH statements.
A = LOAD ‘student’ USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
In this example, Pig will validate and then execute the LOAD, FOREACH, and DUMP statements.
A = LOAD ‘student’ USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)
(Bill)
(Joe)
Pig Relations
Pig Latin statements work with relations. A relation can be defined as follows:
- A relation is a bag (more specifically, an outer bag).
- A bag is a collection of tuples.
- A tuple is an ordered set of fields.
- A field is a piece of data.
A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don’t require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.
Also note that relations are unordered which means there is no guarantee that tuples are processed in any particular order. Furthermore, processing may be parallelized in which case tuples are not processed according to any total ordering.
Data Types
Simple Types | Description | Example |
int | Signed 32-bit integer | 10 |
long | Signed 64-bit integer | Data: 10L or 10l Display: 10L |
float | 32-bit floating point | Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or 1050.0F |
double | 64-bit floating point | Data: 10.5 or 10.5e2 or 10.5E2 Display: 10.5 or 1050.0 |
chararray | Character array (string) in Unicode UTF-8 format | hello world |
bytearray | Byte array (blob) | |
boolean | boolean | true/false (case insensitive) |
datetime | datetime | 1970-01-01T00:00:00.000+00:00 |
biginteger | Java BigInteger | 200000000000 |
bigdecimal | Java BigDecimal | 33.456783321323441233442 |
Complex Types | Description | Example |
tuple | An ordered set of fields. | (19,2) |
bag | An collection of tuples. | {(19,2), (18,1)} |
map | A set of key value pairs. | [open#apache] |
Expressions
In Pig Latin, expressions are language constructs used with the FILTER, FOREACH, GROUP, and SPLIT operators as well as the eval functions.
Expressions are written in conventional mathematical infix notation and are adapted to the UTF-8 character set. Depending on the context, expressions can include:
- Any Pig data type (simple data types, complex data types)
- Any Pig operator (arithmetic, comparison, null, boolean, dereference, sign, and cast)
- Any Pig built in function.
- Any user defined function (UDF) written in Java.
In Pig Latin,
- An arithmetic expression could look like this: X = GROUP A BY f2*f3;
- A string expression could look like this, where a and b are both chararrays: X = FOREACH A GENERATE CONCAT(a,b);
- A boolean expression could look like this: X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1));
Schemas
Schemas enable you to assign names to fields and declare types for fields. Schemas are optional but we encourage you to use them whenever possible; type declarations result in better parse-time error checking and more efficient code execution. Schemas for simple types and complex types can be used anywhere a schema definition is appropriate.
Schemas are defined with the LOAD, STREAM, and FOREACH operators using the AS clause. If you define a schema using the LOAD operator, then it is the load function that enforces the schema.
Known Schema Handling
- You can define a schema that includes both the field name and field type.
- You can define a schema that includes the field name only; in this case, the field type defaults to bytearray.
- You can choose not to define a schema; in this case, the field is un-named and the field type defaults to bytearray.
If you assign a name to a field, you can refer to that field using the name or by positional notation. If you don’t assign a name to a field (the field is un-named) you can only refer to the field using positional notation. If you assign a type to a field, you can subsequently change the type using the cast operators. If you don’t assign a type to a field, the field defaults to bytearray; you can change the default type using the cast operators.
Unknown Schema Handling
- When you JOIN/COGROUP/CROSS multiple relations, if any relation has an unknown schema (or no defined schema, also referred to as a null schema), the schema for the resulting relation is null.
- If you FLATTEN a bag with empty inner schema, the schema for the resulting relation is null.
- If you UNION two relations with incompatible schema, the schema for resulting relation is null.
- If the schema is null, Pig treats all fields as bytearray (in the backend, Pig will determine the real type for the fields dynamically)
If a field’s data type is not specified, Pig will use bytearray to denote an unknown type. If the number of fields is not known, Pig will derive an unknown schema.
/* The field data types are not specified … */
a = load ‘1.txt’ as (a0, b0);
a: {a0: bytearray,b0: bytearray}
/* The number of fields is not known … */
a = load ‘1.txt’;
a: Schema for a unknown
How Pig Handles Schema – As shown above, with a few exceptions Pig can infer the schema of a relationship up front. You can examine the schema of particular relation using DESCRIBE. Pig enforces this computed schema during the actual execution by casting the input data to the expected data type. If the process is successful the results are returned to the user; otherwise, a warning is generated for each record that failed to convert. Note that Pig does not know the actual types of the fields in the input data prior to the execution; rather, Pig determines the data types and performs the right conversions on the fly.
Having a deterministic schema is very powerful; however, sometimes it comes at the cost of performance. Consider the following example:
A = load ‘input’ as (x, y, z);
B = foreach A generate x+y;
If you do DESCRIBE on B, you will see a single column of type double. This is because Pig makes the safest choice and uses the largest numeric type when the schema is not know. In practice, the input data could contain integer values; however, Pig will cast the data to double and make sure that a double result is returned.
If the schema of a relation can’t be inferred, Pig will just use the runtime data as is and propagate it through the pipeline.
Schemas with LOAD and STREAM – With LOAD and STREAM operators, the schema following the AS keyword must be enclosed in parentheses.
In this example the LOAD statement includes a schema definition for simple data types.
A = LOAD ‘data’ AS (f1:int, f2:int);
Schemas with FOREACH – With FOREACH operators, the schema following the AS keyword must be enclosed in parentheses when the FLATTEN operator is used. Otherwise, the schema should not be enclosed in parentheses.
In this example the FOREACH statement includes FLATTEN and a schema for simple data types.
X = FOREACH C GENERATE FLATTEN(B) AS (f1:int, f2:int, f3:int), group;
In this example the FOREACH statement includes a schema for simple expression.
X = FOREACH A GENERATE f1+f2 AS x1:int;
In this example the FOREACH statement includes a schemas for multiple fields.
X = FOREACH A GENERATE f1 as user, f2 as age, f3 as gpa;
Operators
Operator | Description | Example |
Arithmetic Operators | +, -, *, /, %, ?: | X = FOREACH A GENERATE f1, f2, f1%f2; X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B)); |
Boolean Operators | and, or, not | X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1)); |
Cast Operators | Casting from one datatype to another | B = FOREACH A GENERATE (int)$0 + 1; B = FOREACH A GENERATE $0 + 1, $1 + 1.0 |
Comparison Operators | ==, !=, >, <, >=, <=, matches | X = FILTER A BY (f1 == 8); X = FILTER A BY (f2 == ‘apache’); X = FILTER A BY (f1 matches ‘.*apache.*’); |
Construction Operators | Used to construct tuple (), bag {} and map [] | B = foreach A generate (name, age); B = foreach A generate {(name, age)}, {name, age}; B = foreach A generate [name, gpa]; |
Dereference Operators | dereference tuples (tuple.id or tuple.(id,…)), bags (bag.id or bag.(id,…)) and maps (map#’key’) | X = FOREACH A GENERATE f2.t1,f2.t3 (dereferencing is used to retrieve two fields from tuple f2) |
Disambiguate Operator | ( :: ) used to identify field names after JOIN, COGROUP, CROSS, or FLATTEN operators | A = load ‘data1’ as (x, y); B = load ‘data2’ as (x, y, z); C = join A by x, B by x; D = foreach C generate A::y; |
Flatten Operator | Flatten un-nests tuples as well as bags | consider a relation that has a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will cause that tuple to become (a, b, c). |
Null Operator | is null, is not null | X = FILTER A BY f1 is not null; |
Sign Operators | + -> has no effect, – -> changes the sign of a positive/negative number | A = LOAD ‘data’ as (x, y, z); B = FOREACH A GENERATE -x, y; |
Relational Operators
Operator | Description | Example |
COGROUP/GROUP | Groups the data in one or more relations. The COGROUP operator groups together tuples that have the same group key (key field) | A = load ‘student’ AS (name:chararray,age:int,gpa:float); B = GROUP A BY age; |
CROSS | Computes the cross product of two or more relations | X = CROSS A,B A = (1, 2, 3) B = (2, 4) DUMP X; (4, 2, 1) (8, 9) (1,2,3,2,4) (1, 3) (1,2,3,8,9) (1,2,3,1,3) (4,2,1,2,4) (4,2,1,8,9) (4,2,1,1,3) |
DEFINE | Assigns an alias to a UDF or streaming command. | DEFINE CMD `perl PigStreaming.pl – nameMap` input(stdin using PigStreaming(‘,’)) output(stdout using PigStreaming(‘,’)); A = LOAD ‘file’; B = STREAM B THROUGH CMD; |
DISTINCT | Removes duplicate tuples in a relation. | X = DISTINCT A; A = (8,3,4) DUMP X; (1,2,3) (1,2,3) (4,3,3) (4,3,3) (4,3,3) (8,3,4) (1,2,3) |
FILTER | Selects tuples from a relation based on some condition. | X = FILTER A BY f3 == 3; A = (1,2,3) DUMP X; (4,5,6) (1,2,3) (7,8,9) (4,3,3) (4,3,3) (8,4,3) (8,4,3) |
FOREACH | Generates transformation of data for each row as specified | X = FOREACH A GENERATE a1, a2; A = (1,2,3) DUMP X; (4,2,5) (1,2) (8,3,6) (4,2) (8,3) |
IMPORT | Import macros defined in a separate file. | /* myscript.pig */ IMPORT ‘my_macro.pig’; |
JOIN | Performs an inner join of two or more relations based on common field values. | X = JOIN A BY a1, B BY b1; DUMP X (1,2,1,3) A = (1,2) B = (1,3) (1,2,1,2) (4,5) (1,2) (4,5,4,7) (4,7) |
LOAD | Loads data from the file system. | A = LOAD ‘myfile.txt’; LOAD ‘myfile.txt’ AS (f1:int, f2:int, f3:int); |
MAPREDUCE | Executes native MapReduce jobs inside a Pig script. | A = LOAD ‘WordcountInput.txt’; B = MAPREDUCE ‘wordcount.jar’ STORE A INTO ‘inputDir’ LOAD ‘outputDir’ AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`; |
ORDERBY | Sorts a relation based on one or more fields. | A = LOAD ‘mydata’ AS (x: int, y: map[]); B = ORDER A BY x; |
SAMPLE | Partitions a relation into two or more relations, selects a random data sample with the stated sample size. | Relation X will contain 1% of the data in relation A. A = LOAD ‘data’ AS (f1:int,f2:int,f3:int); X = SAMPLE A 0.01; |
SPLIT | Partitions a relation into two or more relations based on some expression. | SPLIT input_var INTO output_var IF (field1 is not null), ignored_var IF (field1 is null); |
STORE | Stores or saves results to the file system. | STORE A INTO ‘myoutput’ USING PigStorage (‘*’); 1*2*3 4*2*1 |
STREAM | Sends data to an external script or program | A = LOAD ‘data’; B = STREAM A THROUGH `stream.pl -n 5`; |
UNION | Computes the union of two or more relations. (Does not preserve the order of tuples) | X = UNION A, B; A = (1,2,3) B = (2,4) DUMP X; (4,2,1) (8,9) (1,2,3) (1,3) (4,2,1) (2,4) (8,9) (1,3) |
Functions
Function | Syntax | Description |
AVG | AVG(expression | Computes the average of the numeric values in a single-column bag. |
CONCAT | CONCAT (expression, expression) | Concatenates two expressions of identical type. |
COUNT | COUNT(expression) | Computes the number of elements in a bag, it ignores null. |
COUNT_STAR | COUNT_STAR(expression) | Computes the number of elements in a bag, it includes null. |
DIFF | DIFF (expression, expression) | Compares two fields in a tuple, any tuples that are in one bag but not the other are returned in a bag. |
DIFF | DIFF (expression, expression) | Compares two fields in a tuple, any tuples that are in one bag but not the other are returned in a bag. |
IsEmpty | IsEmpty(expression) | Checks if a bag or map is empty. |
MAX | MAX(expression) | Computes the maximum of the numeric values or chararrays in a single-column bag |
MIN | MIN(expression) | Computes the minimum of the numeric values or chararrays in a single-column bag. |
SIZE | SIZE(expression) | Computes the number of elements based on any Pig data type. SIZE includes NULL values in the size computation |
SUM | SUM(expression) | Computes the sum of the numeric values in a single-column bag. |
TOKENIZE | TOKENIZE(expression [, ‘field_delimiter’]) | Splits a string and outputs a bag of words. |
Load/Store Functions
Function | Syntax | Description |
Handling Compression | A = load ‘myinput.gz’; store A into ‘myoutput.gz’; | PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression. |
BinStorage | A = LOAD ‘data’ USING BinStorage(); | Loads and stores data in machine-readable format. |
JsonLoader, JsonStorage | A = load ‘a.json’ using JsonLoader(); | Load or store JSON data. |
PigDump | STORE X INTO ‘output’ USING PigDump(); | Stores data in UTF-8 format. |
PigStorage | A = LOAD ‘student’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float); | Loads and stores data as structured text files. |
TextLoader | A = LOAD ‘data’ USING TextLoader(); | Loads unstructured data in UTF-8 format. |
Math Functions
Operator | Description | Example |
ABS | ABS(expression) | Returns the absolute value of an expression. If the result is not negative (x ≥ 0), the result is returned. If the result is negative (x < 0), the negation of the result is returned. |
ACOS | ACOS(expression) | Returns the arc cosine of an expression. |
ASIN | ASIN(expression) | Returns the arc sine of an expression. |
ATAN | ATAN(expression) | Returns the arc tangent of an expression. |
CBRT | CBRT(expression) | Returns the cube root of an expression. |
CEIL | CEIL(expression) | Returns the value of an expression rounded up to the nearest integer. This function never decreases the result value. |
COS | COS(expression) | Returns the trigonometric cosine of an expression. |
COSH | COSH(expression) | Returns the hyperbolic cosine of an expression. |
EXP | EXP(expression) | Returns Euler’s number e raised to the power of x. |
FLOOR | FLOOR(expression) | Returns the value of an expression rounded down to the nearest integer. This function never increases the result value. |
LOG | LOG(expression) | Returns the natural logarithm (base e) of an expression. |
LOG10 | LOG10(expression) | Returns the base 10 logarithm of an expression. |
RANDOM | RANDOM( ) | Returns a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0. |
ROUND | ROUND(expression) | Returns the value of an expression rounded to an integer (if the result type is float) or rounded to a long (if the result type is double). |
SIN | SIN(expression) | Returns the sine of an expression. |
SINH | SINH(expression) | Returns the hyperbolic sine of an expression. |
SQRT | SQRT(expression) | Returns the positive square root of an expression. |
TAN | TAN(expression) | Returns the trignometric tangent of an angle. |
TANH | TANH(expression) | Returns the hyperbolic tangent of an expression. |
String Functions
Operator | Description | Example |
INDEXOF | INDEXOF(string, ‘character’, startIndex) | Returns the index of the first occurrence of a character in a string, searching forward from a start index. |
LAST_INDEX | LAST_INDEX_OF(expression) | Returns the index of the last occurrence of a character in a string, searching backward from a start index. |
LCFIRST | LCFIRST(expression) | Converts the first character in a string to lower case. |
LOWER | LOWER(expression) | Converts all characters in a string to lower case. |
REGEX_EXTRACT | REGEX_EXTRACT (string, regex, index) | Performs regular expression matching and extracts the matched group defined by an index parameter. The function uses Java regular expression form. |
REGEX_EXTRACT_ALL | REGEX_EXTRACT (string, regex) | Performs regular expression matching and extracts all matched groups. The function uses Java regular expression form. |
REPLACE | REPLACE(string, ‘oldChar’, ‘newChar’); | Replaces existing characters in a string with new characters. |
STRSPLIT | STRSPLIT(string, regex, limit) | Splits a string around matches of a given regular expression. |
SUBSTRING | SUBSTRING(string, startIndex, stopIndex) | Returns a substring from a given string. |
TRIM | TRIM(expression) | Returns a copy of a string with leading and trailing white space removed. |
UCFIRST | UCFIRST(expression) | Returns a string with the first character converted to upper case. |
UPPER | UPPER(expression) | Returns a string converted to upper case. |
Tuple, Bag, Map Functions
Operator | Description | Example |
TOTUPLE | TOTUPLE(expression [, expression …]) | Converts one or more expressions to type tuple. |
TOBAG | TOBAG(expression [, expression …]) | Converts one or more expressions to individual tuples which are then placed in a bag. |
TOMAP | TOMAP(key-expression, value-expression [, key-expression, value-expression …]) | Converts key/value expression pairs into a map. Needs an even number of expressions as parameters. The elements must comply with map type rules. |
TOP | TOP(topN,column,relation) | Returns the top-n tuples from a bag of tuples. |
Loading Data
Use the LOAD operator and the load/store functions to read data into Pig (PigStorage is the default load function).
Working with Data
Pig allows you to transform data in many ways. As a starting point, become familiar with these operators:
- Use the FILTER operator to work with tuples or rows of data. Use the FOREACH operator to work with columns of data.
- Use the GROUP operator to group data in a single relation. Use the COGROUP, inner JOIN, and outer JOIN operators to group or join data in two or more relations.
- Use the UNION operator to merge the contents of two or more relations. Use the SPLIT operator to partition the contents of a relation into multiple relations.
Storing Intermediate Results
Pig stores the intermediate data generated between MapReduce jobs in a temporary location on HDFS. This location must already exist on HDFS prior to use. This location can be configured using the pig.temp.dir property. The property’s default value is “/tmp” which is the same as the hardcoded location in Pig 0.7.0 and earlier versions.
Storing Final Results
Use the STORE operator and the load/store functions to write results to the file system (PigStorage is the default store function). During the testing/debugging phase of your implementation, you can use DUMP to display results to your terminal screen. However, in a production environment you always want to use the STORE operator to save your results.
Debugging Pig Latin
Pig Latin provides operators that can help you debug your Pig Latin statements:
- Use the DUMP operator to display results to your terminal screen.
- Use the DESCRIBE operator to review the schema of a relation.
- Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.
- Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.
Shortcuts for Debugging Operators
Pig provides shortcuts for the frequently used debugging operators (DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE). These shortcuts can be used in Grunt shell or within pig scripts. Following are the shortcuts supported by pig
- \d alias – shortcut for DUMP operator. If alias is ignored last defined alias will be used.
- \de alias – shortcut for DESCRIBE operator. If alias is ignored last defined alias will be used.
- \e alias – shortcut for EXPLAIN operator. If alias is ignored last defined alias will be used.
- \i alias – shortcut for ILLUSTRATE operator. If alias is ignored last defined alias will be used.
- \q – To quit grunt shell
Pig has a number of command-line options that you can use with it. You can see the full list by entering pig -h. Few of these options are
- -e or -execute — Execute a single command in Pig. For example, pig -e fs -ls will list your home directory.
- -h or -help — List the available command-line options.
- -h properties — List the properties that Pig will use if they are set by the user.
- -P or –propertyFile — Specify a property file that Pig should read.
- -version — Print the version of Pig.
An expression is something that is evaluated to yield a value. Expressions can be used in Pig as a part of a statement containing a relational operator. Pig has a rich variety of expressions, many of which will be familiar from other programming languages.