Logs come in all shapes, but as applications and infrastructures grow, the result is a massive amount of distributed data that’s useful to mine. From web and mail servers to kernel and boot logs, modern servers hold a rich set of information. Massive amounts of distributed data are a perfect application for Apache Hadoop, as are log files—time-ordered structured textual data.
You can use log processing to extract a variety of information. One of its most common uses is to extract errors or count the occurrence of some event within a system (such as login failures). You can also extract some types of performance data, such as connections or transactions per second. Other useful information includes the extraction (map) and construction of site visits (reduce) from a web log. This analysis can also support detection of unique user visits in addition to file access statistics.
Hadoop uses the Apache log4j via the Apache Commons Logging framework for logging. Edit the conf/log4j.properties file to customize the Hadoop daemons’ logging configuration (log-formats and so on).
Log4j is pretty simple, to log message you just use logger.info(“info message”) to log error message with exception use logger.error(“error message”,exceptionObject)
Here is example code on how to use Log4j in a Mapper class. Same goes for Driver and Reducer
package com.test;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.log4j.Logger;
public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
private static final Logger logger = Logger.getLogger(MyMapper.class);
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
//logger.error(“This is error”, e);
logger.error(“This is error”);
logger.warn(“This is warning”);
logger.info(“This is info”);
logger.debug(“This is info”);
logger.trace(“This is info”);
}
}
If you have hadoop cluster the log message from individual mapper/reducers will be in the nodes that executed them
Hadoop daemons have a web page for changing the log level for any log4j log name, which can be found at /logLevel in the daemon’s web UI. By convention, log names in Hadoop correspond to the names of the classes doing the logging, although there are exceptions to this rule, so you should consult the source code to find log names.
It’s also possible to enable logging for all packages that start with a given prefix. For example, to enable debug logging for all classes related to the resource manager, we would visit the its web UI at http://resource-manager-host:8088/logLevel and set the log name org.apache.hadoop.yarn.server.resourcemanager to level DEBUG.
Log levels changed in this way are reset when the daemon restarts, which is usually what you want. However, to make a persistent change to a log level, you can simply change the log4j.properties file in the configuration directory. In this case, the line to add is
log4j.logger.org.apache.hadoop.yarn.server.resourcemanager=DEBUG
In Hadoop 2, setting yarn.app.mapreduce.am.log.level is used to set the log level you need, but, crucially, it needs to be set in the Hadoop job configuration at submission time. It cannot be set on the cluster globally. The cluster global will always default to INFO, as it is hardcoded.
Using container-log4j.properties alone will not work as YARN will override the log level value on the command line. See the method addLog4jSystemProperties of org.apache.hadoop.mapreduce.v2.util.MRApps and cross reference with org.apache.hadoop.mapreduce.MRJobConfig.
container-log4j.properties will indeed be honored, but it can’t override the level set by this property.
HDFS can log all filesystem access requests, a feature that some organizations require for auditing purposes. Audit logging is implemented using log4j logging at the INFO level. In the default configuration it is disabled, but it’s easy to enable by adding the following line to hadoop-env.sh:
export HDFS_AUDIT_LOGGER=”INFO,RFAAUDIT”
A log line is written to the audit log (hdfs-audit.log) for every HDFS event.