UDF and Data Processing Operator
Pig Latin also provides three statements—REGISTER, DEFINE, and IMPORT—that make it possible to incorporate macros and user-defined functions into Pig scripts. REGISTER, Registers a JAR file with the Pig runtime, DEFINE, Creates an alias for a macro, UDF, streaming script, or command specification. IMPORT, Imports macros defined in a separate file into a script.
The DEFINE statement is used to assign an alias to an external executable or a UDF function. Use this statement if you want to have a crisp name for a function that has a lengthy package name.
For a STREAM command, DEFINE plays an important role to transfer the executable to the task nodes of the Hadoop cluster. This is accomplished using the SHIP clause of the DEFINE operator.
UDFs
Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript and Ruby.
Registering UDFs
Registering Java UDFs:
—register_java_udf.pig
register ‘your_path_to_piggybank/piggybank.jar’;
divs = load ‘NYSE_dividends’ as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
Registering Python UDFs (The Python script must be in your current directory):
–register_python_udf.pig
register ‘production.py’ using jython as bballudfs;
players = load ‘baseball’ as (name:chararray, team:chararray,
pos:bag{t:(p:chararray)}, bat:map[]);
Writing UDFs
Java UDFs:
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UPPER extends EvalFunc
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw new IOException(“Caught exception processing input row “, e);
}
}
}
Python UDFs
#Square – Square of a number of any data type
@outputSchemaFunction(“squareSchema”) — Defines a script delegate function that defines schema for this function depending upon the input type.
def square(num):
return ((num)*(num))
@schemaFunction(“squareSchema”) –Defines delegate function and is not registered to Pig.
def squareSchema(input):
return input
#Percent- Percentage
@outputSchema(“percent:double”) –Defines schema for a script UDF in a format that Pig understands and is able to parse
def percent(num, total):
return num * 100 / total
Apply for Big Data and Hadoop Developer Certification
https://www.vskills.in/certification/certified-big-data-and-apache-hadoop-developer