Written By : James Tay (james@2longbeans.net)
Start Date : 10th Nov 2009
Last Revision : 25th May 2010
First, my big disclaimer. This anomaly detection tool was designed to reduce the workload of a human tasked with searching through tons of log files. Each server outputs several log files, and each log file has its own characteristic messages that (probably) repeat. If a human had infinite memory and recall, he would be able to spot the 1 message that doesn't "look right". This program attempts to perform just that.
The basic idea behind this program is to detect anomalies by observing live messages, while comparing them against an archive of old messages for that particular server (and that particular syslog facility). To flag a message as unusual, it must meet one of 2 characteristics.
For messages that fall under the first category, consider the following example:
pam_unix(sshd:session): session opened for user smith by (uid=0) pam_unix(sshd:session): session opened for user smith by (uid=0) error: PAM: Authentication failure for jane from dhcp-80.umdnj.edu
In the above example, the 3rd line looks like nothing we've ever seen before. Thus, it has very poor alignment against any messages in our archive. For the second scenerio, consider the following messages :
error: PAM: Authentication failure for jane from dhcp-45.umdnj.edu error: PAM: Authentication failure for jane from dhcp-20.umdnj.edu error: PAM: Authentication failure for jane from dhcp-36.umdnj.edu error: PAM: Authentication failure for jane from dhcp-19.umdnj.edu error: PAM: Authentication failure for jane from omega132.rpi.edu
In the above example, the 5th line looks quite similar to the first 4, but we still consider it to be an anomaly because we are used to seeing jane login (incorrectly) from dhcp-XX.umdnj.edu, but never from rpi.edu. Thus, the 5th line ought to be flagged as an anomaly. This program accomplishes that by building signatures of the first 4 lines. It extracts portions of text it finds as common ground, and joints it together. Ie,
error: PAM: Authentication failure for jane from dhcp-.umdnj.edu
Notice the changing component has been stripped out. Now, it calculates an MD5 hash of the above string. Ie,
e91172955c17e8cce7a3bab7cca43747
Since the first 4 lines will bear the same MD5 hash, we note this down in our signature database as a key/value pair, where the key is the MD5 hash, and the value is the number of times we've encountered this MD5 hash. Ie,
(e91172955c17e8cce7a3bab7cca43747, 4)
When we receive the 5th line (presumably as a live input), this program tries to match it to the most similar line in our archive, and then identifies the portions where the string match occurs. Resulting in,
error: PAM: Authentication failure for jane from .edu
We take the above string and calculate its MD5 hash, which is,
662c49efb9bd9e77b47cce89f401ac3f
It becomes clear that we have never seen this MD5 hash before, and thus conclude that the message is unique, and therefore an anomaly.
Consider a site whereby syslog message are collected from multiple sources on a single server using syslog-ng. The syslog-ng process is configured to log messages from each host into individual directories, and identified by the syslog facility (eg, auth, kern, daemon, local0, etc). Thus syslog-ng.conf would contain :
source src_udp { udp () ; } ; destination dst_default { file ("/log/$HOST/$YEAR-$MONTH-$DAY.$FACILITY") ; } ; log { source (src_udp) ; destination (dst_default) ; } ;
Next, we configure our servers "coke" and "pepsi" to deliver syslog messages to our syslog-ng server. After a few days, we would have accumulated several log files, for each host, for each facility, for each day. For example:
% ls -l /log/pepsi -rw-r--r-- 1 root root 40456 Nov 21 23:08 2009-11-21.daemon -rw-r--r-- 1 root root 40667 Nov 22 23:04 2009-11-22.daemon -rw-r--r-- 1 root root 40751 Nov 23 23:13 2009-11-23.daemon -rw-r--r-- 1 root root 24467 Nov 24 13:24 2009-11-24.daemon
[host] archive_dir "/log/pepsi" archive_source "[0-9-]*.daemon$" archive_maxage 604800 start_field 6 log_dir "/log/pepsi" log_source "[0-9-]*.daemon$" sig_db "/var/lib/swlog/pepsi.daemon.db" min_match 95 rebuild_signatures 86400 anomaly_log "/var/log/pepsi.anomaly.log"
At this point, it is important to talk about swlog's config file. All strings must be presented with double-quotes (eg, "foo"). The regular expressions used must be POSIX regular expressions, which aren't the same as expressions used in shell expansions. For example, the shell expansion for "*.log" will match a file named "messages.log". But the command "echo 'messages.log' | grep '*.log'" returns no match.
When swlog starts up and parses the config file, it performs the following when encountering the above host block :
After parsing and loading up data associated with all host blocks, swlog enters its main loop. During each iteration of the main loop, the following actions occur :
Messages generated from swlog are written to a log file. The following directives in its config file control this behavior, and should be specified in the beginning of the file :
debug_lvl 0 log_file "/var/run/swlog.log"
The debug level is typically set to 0. More messages are printed as the debug level is increased. The debug level can be changed during program runtime by delivering SIGUSR2 to increment it, and SIGUSR1 to decrement it. Delivering SIGINT or SIGTERM results in an orderly shutdown. Delivering SIGHUP prints runtime information and closes the current log_file (useful for rotating logfiles without program restarts). Statistics are printed out for each host block declared in the config file. The following is sample output for the host block used above.
2009-11-24 15:24:39 INFO: sig_db:/var/lib/swlog/pepsi.daemon.db count:787 ave:8 std_dev:44 2009-11-24 15:24:39 INFO: in_file:/log/pepsi/2009-11-24.daemon offset:28120 2009-11-24 15:24:39 INFO: bigbuf_size:571726 lines:6376 2009-11-24 15:24:39 INFO: cur_sigbuild:0 2009-11-24 15:24:39 INFO: sw_aligns:2739683 2009-11-24 15:24:39 INFO: sw_faligns:11440 2009-11-24 15:24:39 INFO: reg_matches:7242238 2009-11-24 15:24:39 INFO: reg_fmatches:11440 2009-11-24 15:24:39 INFO: msg_bytes:5602 2009-11-24 15:24:39 INFO: msg_processed:122 2009-11-24 15:24:39 INFO: msg_anomalies:13 2009-11-24 15:24:39 INFO: sig_gethit:110 2009-11-24 15:24:39 INFO: sig_getmiss:12
When an anomaly is detected, it is written to the anomaly log file specified in the host block. An archived message, most similar to the one just picked up is provided as reference. Consider the following entry :
2009-11-24 14:30:15 match:93% count:0 Message: Nov 24 14:30:15 192.168.20.4 sudo: golharam : TTY=pts/2 ; PWD=/home/golharam ; USER=root ; COMMAND=/bin/ls -alp /var/log/httpd/ Archive: sudo: golharam : TTY=pts/0 ; PWD=/home/golharam ; USER=root ; COMMAND=/bin/ls /var/log/httpd
In the above example, we see a user "golharam" performing a sudo. This message may be 93% similar to one found in the archives, but is in fact unique (because count is 0). According to the archives, swlog has seen him execute "/bin/ls /var/log/httpd" before, but we have never seen him execute "/bin/ls -alp /var/log/httpd/". Although this anomaly may not be a cause for concern, it's an example that swlog is able to detect subtle variations in syslog messages.
In order to accomodate changes in machine behavior, it is necessary to recognize new repeating events as normal. This involves reloading new data into the memory buffers as well as rebuilding the signature database. In order to allow processing of messages during database rebuilds, a seperate thread is responsible for detecting when a database is out of date and rebuilding it. This thread is started just before swlog enters the main loop. Its job is to periodically check if any database ages beyond the "rebuild_signatures" directive, in the above example, it is 86400 seconds, meaning daily.
Unfortunately, the process of rebuilding a signature database involves tossing out the old buffer of archived messages from ram. This makes real time analysis impossible. On the other hand, other host logs not involved in the rebuild process can continue to run unaffected. Thus, the main thread, and the signature rebuilding thread each try to acquire a mutex when working on each host block.
Signature rebuilding is the single most CPU intensive aspect of swlog. If written to use many threads, swlog can easily consume all available CPUs in a machine. Thus, signature rebuilding was delibrately designed to run in not more than 1 thread. That much said, the system administrator may run multiple instances of swlog, each operating off a different config file, which in turn may processes multiple server log files in parallel.