Overview
In this tutorial you will learn how to integrate your application log with flume agent through Log4j,then store it in hdfs path and later on analysis it with hive.
Sometimes the events from the applications have to be analyzed to know more about the customer behavior for recommendations or to figure any fraudulent use cases. With more data to analyze, it might take a lot of time or some times even not possible to process the events on a single machine. This is where distributed systems like Hadoop and others comes into play..
Apache Flume can be used to move the data from the source to the sink.One of the option is to make the application use Log4J to send the log events to a Flume sink which will store them in HDFS for further analysis.Flume has a log4j appender that can directly send the log events from your application to flume agents. Lets get started.
Flume Configuration
Lets create file name log4j-test.conf under $FLUME_HOME/conf directory
log4j-test.conf
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = hdfs-sink1
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind = 0.0.0.0
agent1.sources.avro-source1.port = 41414
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.hdfs-sink1.type = hdfs
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://localhost:9000/user/flume/logevents/
agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
agent1.sinks.hdfs-sink1.hdfs.writeFormat = Text
# Number of seconds to wait before rolling current file (in seconds)
agent1.sinks.hdfs-sink1.hdfs.rollInterval=0
# File size to trigger roll, in bytes
agent1.sinks.hdfs-sink1.hdfs.rollSize = 500
# never roll based on number of events
agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
# Timeout after which inactive files get closed (in seconds)
agent1.sinks.hdfs-sink1.hdfs.idleTimeout = 3600
#chain the different components together
agent1.sinks.hdfs-sink1.channel = ch1
agent1.sources.avro-source1.channels = ch1
Note: You need to configure based on your requirement.Above configuration is simple and general way of demonstrating the tutorial.
Develop Application
Now lets create a simple application that will log its event directly to flume agent through Log4j appenders.
Following Jars highlighted are required inside your application.( All jars are available inside flume distribution)
Remember Flume log4j appender only works with INFO,WARN and ERROR level,not with DEBUG level (may be its a bug,but its designed with thought about production usage,hence debug is not applicable). So here's how it works.
Let say you have Base Logger Class, through which your are logging in your application:
package com.test.base.logging;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public final class FlumeLogger {
public final static Logger l = LoggerFactory.getLogger(FlumeLogger.class.getName());
}
And say this is your application class,where logging is used:
package com.test.generate.log;
import com.test.base.logging.FlumeLogger;
public class ApplicationLog {
public static void main(String[] args) {
for(int i=0;i<10;i++){
//debug level wont log any event to flume
#FlumeLogger.l.debug("Test msg : "+i);
FlumeLogger.l.info("Test msg : "+i);
}
}
}
In-order to make your logging event go straight to flume agent,you need to have this following things in your application's log4j properties file:
#Remove Flume appender from rootLogger,otherwise same log event will be written twice.
#log4j.rootLogger = INFO, flume
log4j.rootLogger = INFO
# Define the the Base Logger Class with package. DEBUG level wont work.Default level is INFO,but ERROR, WARN will work too.
#log4j.logger.com.test.base.logging.FlumeLogger = DEBUG, flume
log4j.logger.com.test.base.logging.FlumeLogger = INFO, flume
# Define the flume appender
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = localhost
log4j.appender.flume.Port = 41414
log4j.appender.flume.UnsafeMode = false
log4j.appender.flume.layout=org.apache.log4j.PatternLayout
#Add this Conversion Pattern so that we can take the input from Flume into Hive based on this pattern
log4j.appender.flume.layout.ConversionPattern= %d | %p | %c | %m%n
Now the application is ready lets start our flume agent. First start your hdfs cluster,then start your flume agents from $FLUME_HOME/bin on the node where your application is running by using following command:
./flume-ng agent --conf conf --conf-file /home/kuntal/practice/BigData/apache-flume-1.5.0/conf/log4j-test.conf --name agent1 -Dflume.root.logger=INFO,console
Start your application and check the hdfs flume directory user/flume/logevents
As the events are flowing nicely from Application -> Flume Agents -> HDFS.
We are all set to analyse this data using Hive.
Hive Table Creation
The application log format will be like:
2014-09-29 12:57:41,222 | INFO | com.test.generate.log.ApplicationLog | Test msg : 0
So now lets create Hive external table to parse and analyse this log format.
Go to $HIVE_HOME/bin and start hive (./hive)
CREATE EXTERNAL TABLE logevents (timestamp STRING, level STRING, className STRING, message STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/flume/logevents';
And now its time to query some data.
select * from logevents;
Future Considerations
As you can see, this example saves log file according to hours. This will only work for short term or transactional data that need to be referenced often. The problem here is the number of files. At some point, it will be necessary to aggregate some of this data into fewer files, which will make it just a little harder to reference. Just something to think about.
Once you have the data in Hive, then you will need to consider the next level of detail, what to do with all this interesting data. Depending upon who will view the data and how it will be used, you will need to set something up for your users.
Conclusion
I hope you have found this tutorial useful, and it has helped to clear up some of the mystery of managing Log Files with Flume and Hive.
In my upcoming tutorials on Real world Log analysis, i will show how to do real time analysis of apache server log with flume and hive.Also there will be some customization of flume and hive to suit your need.
Also you will learn how to analyse the data with pig and graphically represents it using GnuPlot.
So Stay tuned for some big data adventure !!
In this tutorial you will learn how to integrate your application log with flume agent through Log4j,then store it in hdfs path and later on analysis it with hive.
Sometimes the events from the applications have to be analyzed to know more about the customer behavior for recommendations or to figure any fraudulent use cases. With more data to analyze, it might take a lot of time or some times even not possible to process the events on a single machine. This is where distributed systems like Hadoop and others comes into play..
Apache Flume can be used to move the data from the source to the sink.One of the option is to make the application use Log4J to send the log events to a Flume sink which will store them in HDFS for further analysis.Flume has a log4j appender that can directly send the log events from your application to flume agents. Lets get started.
Flume Configuration
Lets create file name log4j-test.conf under $FLUME_HOME/conf directory
log4j-test.conf
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = hdfs-sink1
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind = 0.0.0.0
agent1.sources.avro-source1.port = 41414
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.hdfs-sink1.type = hdfs
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://localhost:9000/user/flume/logevents/
agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
agent1.sinks.hdfs-sink1.hdfs.writeFormat = Text
# Number of seconds to wait before rolling current file (in seconds)
agent1.sinks.hdfs-sink1.hdfs.rollInterval=0
# File size to trigger roll, in bytes
agent1.sinks.hdfs-sink1.hdfs.rollSize = 500
# never roll based on number of events
agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
# Timeout after which inactive files get closed (in seconds)
agent1.sinks.hdfs-sink1.hdfs.idleTimeout = 3600
#chain the different components together
agent1.sinks.hdfs-sink1.channel = ch1
agent1.sources.avro-source1.channels = ch1
Note: You need to configure based on your requirement.Above configuration is simple and general way of demonstrating the tutorial.
Develop Application
Now lets create a simple application that will log its event directly to flume agent through Log4j appenders.
Following Jars highlighted are required inside your application.( All jars are available inside flume distribution)
Remember Flume log4j appender only works with INFO,WARN and ERROR level,not with DEBUG level (may be its a bug,but its designed with thought about production usage,hence debug is not applicable). So here's how it works.
Let say you have Base Logger Class, through which your are logging in your application:
package com.test.base.logging;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public final class FlumeLogger {
public final static Logger l = LoggerFactory.getLogger(FlumeLogger.class.getName());
}
And say this is your application class,where logging is used:
package com.test.generate.log;
import com.test.base.logging.FlumeLogger;
public class ApplicationLog {
public static void main(String[] args) {
for(int i=0;i<10;i++){
//debug level wont log any event to flume
#FlumeLogger.l.debug("Test msg : "+i);
FlumeLogger.l.info("Test msg : "+i);
}
}
}
In-order to make your logging event go straight to flume agent,you need to have this following things in your application's log4j properties file:
#Remove Flume appender from rootLogger,otherwise same log event will be written twice.
#log4j.rootLogger = INFO, flume
log4j.rootLogger = INFO
# Define the the Base Logger Class with package. DEBUG level wont work.Default level is INFO,but ERROR, WARN will work too.
#log4j.logger.com.test.base.logging.FlumeLogger = DEBUG, flume
log4j.logger.com.test.base.logging.FlumeLogger = INFO, flume
# Define the flume appender
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = localhost
log4j.appender.flume.Port = 41414
log4j.appender.flume.UnsafeMode = false
log4j.appender.flume.layout=org.apache.log4j.PatternLayout
#Add this Conversion Pattern so that we can take the input from Flume into Hive based on this pattern
log4j.appender.flume.layout.ConversionPattern= %d | %p | %c | %m%n
Now the application is ready lets start our flume agent. First start your hdfs cluster,then start your flume agents from $FLUME_HOME/bin on the node where your application is running by using following command:
./flume-ng agent --conf conf --conf-file /home/kuntal/practice/BigData/apache-flume-1.5.0/conf/log4j-test.conf --name agent1 -Dflume.root.logger=INFO,console
Start your application and check the hdfs flume directory user/flume/logevents
As the events are flowing nicely from Application -> Flume Agents -> HDFS.
We are all set to analyse this data using Hive.
Hive Table Creation
The application log format will be like:
2014-09-29 12:57:41,222 | INFO | com.test.generate.log.ApplicationLog | Test msg : 0
So now lets create Hive external table to parse and analyse this log format.
Go to $HIVE_HOME/bin and start hive (./hive)
CREATE EXTERNAL TABLE logevents (timestamp STRING, level STRING, className STRING, message STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/flume/logevents';
And now its time to query some data.
select * from logevents;
Future Considerations
As you can see, this example saves log file according to hours. This will only work for short term or transactional data that need to be referenced often. The problem here is the number of files. At some point, it will be necessary to aggregate some of this data into fewer files, which will make it just a little harder to reference. Just something to think about.
Once you have the data in Hive, then you will need to consider the next level of detail, what to do with all this interesting data. Depending upon who will view the data and how it will be used, you will need to set something up for your users.
Conclusion
I hope you have found this tutorial useful, and it has helped to clear up some of the mystery of managing Log Files with Flume and Hive.
In my upcoming tutorials on Real world Log analysis, i will show how to do real time analysis of apache server log with flume and hive.Also there will be some customization of flume and hive to suit your need.
Also you will learn how to analyse the data with pig and graphically represents it using GnuPlot.
So Stay tuned for some big data adventure !!