Installation
Pig core is written in Java and it works across operating systems. Pig’s shell, which executes the commands from the user, is a bash script and requires a UNIX system. Pig can also be run on Windows using Cygwin and Perl packages.
Java 1.6 is also mandatory for Pig to run. Optionally, the following can be installed on the same machine: Python 2.5, JavaScript 1.7, Ant 1.7, and JUnit 4.5. Python and JavaScript are for writing custom UDFs. Ant and JUnit are for builds and unit testing, respectively. Pig can be executed with different versions of Hadoop by setting HADOOP_HOME to point to the directory where we have installed Hadoop. If HADOOP_HOME is not set, Pig will run with the embedded version by default, which is currently Hadoop 1.0.0.
Requirements
Mandatory – Unix and Windows users need the following:
- Hadoop 0.23.X, 1.X or 2.X – http://hadoop.apache.org/common/releases.html (You can run Pig with different versions of Hadoop by setting HADOOP_HOME to point to the directory where you have installed Hadoop. If you do not set HADOOP_HOME, by default Pig will run with the embedded version, currently Hadoop 1.0.4.)
- Java 1.7 – http://java.sun.com/javase/downloads/index.jsp (set JAVA_HOME to the root of your Java installation)
Optional
- Python 2.7 – https://www.python.org (when using Streaming Python UDFs)
- Ant 1.8 – http://ant.apache.org/ (for builds)
Download Pig
To get a Pig distribution, do the following:
- Download a recent stable release from one of the Apache Download Mirrors.
- Unpack the downloaded Pig distribution, and then note the following:
- The Pig script file, pig, is located in the bin directory (/pig-n.n.n/bin/pig). The Pig environment variables are described in the Pig script file.
- The Pig properties file, pig.properties, is located in the conf directory (/pig-n.n.n/conf/pig.properties). You can specify an alternate location using the PIG_CONF_DIR environment variable.
- Add /pig-n.n.n/bin to your path. Use export (bash,sh,ksh) or setenv (tcsh,csh). For example:
$ export PATH=/<my-path-to-pig>/pig-n.n.n/bin:$PATH
- Test the Pig installation with this simple command: $ pig -help
Build Pig
To build pig, do the following:
- Check out the Pig code from SVN: svn co http://svn.apache.org/repos/asf/pig/trunk
- Build the code from the top directory: ant If the build is successful, you should see the pig.jar file created in that directory.
- Validate the pig.jar by running a unit test: ant test
- If you are using Hadoop 0.23.X or 2.X, please add -Dhadoopversion=23 in your ant command line in the previous steps
Pig Modes
You can run Pig (execute Pig Latin statements and Pig commands) using various modes.
Local Mode | Tez Local Mode | Mapreduce Mode | Tez Mode | |
Interactive Mode | yes | experimental | yes | yes |
Batch Mode | yes | experimental | yes | yes |
Execution Modes
Pig has two execution modes or exectypes:
- Local Mode – To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local).
- Tez Local Mode – To run Pig in tez local mode. It is similar to local mode, except internally Pig will invoke tez runtime engine. Specify Tez local mode using the -x flag (pig -x tez_local). Tez local mode is experimental. There are some queries which just error out on bigger data in local mode.
- Mapreduce Mode – To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don’t need to, specify it using the -x flag (pig OR pig -x mapreduce).
- Tez Mode – To run Pig in Tez mode, you need access to a Hadoop cluster and HDFS installation. Specify Tez mode using the -x flag (-x tez).
You can run Pig in either mode using the “pig” command (the bin/pig Perl script) or the “java” command (java -cp pig.jar …).
Example
This example shows how to run Pig in local and mapreduce mode using the pig command.
/* local mode */
$ pig -x local …
/* Tez local mode */
$ pig -x tez_local …
/* mapreduce mode */
$ pig …
or
$ pig -x mapreduce …
/* Tez mode */
$ pig -x tez …
Interactive Mode
You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the “pig” command (as shown below) and then enter your Pig Latin statements and Pig commands interactively at the command line.
Example
These Pig Latin statements extract all user IDs from the /etc/passwd file. First, copy the /etc/passwd file to your local working directory. Next, invoke the Grunt shell by typing the “pig” command (in local or hadoop mode). Then, enter the Pig Latin statements interactively at the grunt prompt (be sure to include the semicolon after each statement). The DUMP operator will display the results to your terminal screen.
grunt> A = load ‘passwd’ using PigStorage(‘:’);
grunt> B = foreach A generate $0 as id;
grunt> dump B;
Local Mode
$ pig -x local
… – Connecting to …
grunt>
Tez Local Mode
$ pig -x tez_local
… – Connecting to …
grunt>
Mapreduce Mode
$ pig -x mapreduce
… – Connecting to …
grunt>
or
$ pig
… – Connecting to …
grunt>
Tez Mode
$ pig -x tez
… – Connecting to …
grunt>
Batch Mode
You can run Pig in batch mode using Pig scripts and the “pig” command (in local or hadoop mode).
Example
The Pig Latin statements in the Pig script (id.pig) extract all user IDs from the /etc/passwd file. First, copy the /etc/passwd file to your local working directory. Next, run the Pig script from the command line (using local or mapreduce mode). The STORE operator will write the results to a file (id.out).
/* id.pig */
A = load ‘passwd’ using PigStorage(‘:’); — load the passwd file
B = foreach A generate $0 as id; — extract the user IDs
store B into ‘id.out’; — write the results to a file name id.out
Local Mode
$ pig -x local id.pig
Tez Local Mode
$ pig -x tez_local id.pig
Mapreduce Mode
$ pig id.pig
or
$ pig -x mapreduce id.pig
Tez Mode
$ pig -x tez id.pig