Hadoop & Mapreduce Tutorial | Logging

Logs come in all shapes, but as applications and infrastructures grow, the result is a massive amount of distributed data that’s useful to mine. From web and mail servers to kernel and boot logs, modern servers hold a rich set of information. Massive amounts of distributed data are a perfect application for ApacheHadoop, as are log files—time-ordered structured textual data.

You can use log processing to extract a variety of information. One of its most common uses is to extract errors or count the occurrence of some event within a system (such as login failures). You can also extract some types of performance data, such as connections or transactions per second. Other useful information includes the extraction (map) and construction of site visits (reduce) from a web log. This analysis can also support detection of unique user visits in addition to file access statistics.

It uses the Apache log4j via the Apache Commons Logging framework for logging. Edit the conf/log4j.properties file to customize the Hadoop daemons’ logging configuration (log-formats and so on).

Log4j is pretty simple, to log message you just use logger.info(“info message”) to log error message with exception use logger.error(“error message”,exceptionObject)

Here is example code on how to use Log4j in a Mapper class. Same goes for Driver and Reducer

package com.test;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.log4j.Logger;

public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

private static final Logger logger = Logger.getLogger(MyMapper.class);

@Override

protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {

//logger.error(“This is error”, e);

logger.error(“This is error”);

logger.warn(“This is warning”);

logger.info(“This is info”);

logger.debug(“This is info”);

logger.trace(“This is info”);

}

Hadoop daemons have a web page for changing the log level for any log4j log name, which can be found at /logLevel in the daemon’s web UI. By convention, log names in Hadoop correspond to the names of the classes doing the logging, although there are exceptions to this rule, so you should consult the source code to find log names.

It’s also possible to enable logging for all packages that start with a given prefix. For example, to enable debug logging for all classes related to the resource manager, we would visit the its web UI at http://resource-manager-host:8088/logLevel and set the log name org.apache.hadoop.yarn.server.resourcemanager to level DEBUG.

Log levels changed in this way are reset when the daemon restarts, which is usually what you want. However, to make a persistent change to a log level, you can simply change the log4j.properties file in the configuration directory. In this case, the line to add is

log4j.logger.org.apache.hadoop.yarn.server.resourcemanager=DEBUG