Monday, April 14, 2008

Log files




<br /> Log Files<br />



Log files are one of those must have things for any web application. It is just so hard to predict all of the possible ways users are going to interact with the site that gathering post live information about application behaviour is essential. It does however produce quite a lot of data.



On my current project, after a fairly significant release we resolved to check the log files to see if there were any unexpected incidents that required fixes or web content changes. The log files ran to several gigabytes containing entries not only from the application but also from a very noisy subsystems. After a quick look it became evident that it would not be effective just scanning through the log files but that some cleaning or automated analysis was required.



On the initial scan we noticed that there were some Java stack traces being repeated so something that could capture the distinct stack traces and then list the errors that caused them. In this was we could look at the general issues based on priority (number of occurrences).



Our first effort was to write a fairly simple ruby script using a hash map keyed on the text of the stack trace. Each map entry contained an array of error lines from the log files. We kicked the script off after testing with a small log file and went to lunch.



When we came back the script was still running. Sometime later it ran out of memory - not ideal.



It has been sometime since I have written any C++. Most of my work these days involves Java, C# and a little bit of Ruby for work around the codebase so it took a little while for my C++ brain to kick in. A colleague also took up the challenge by writing a solution in Python.



The results were quite startling with the Python script performing almost as well as the C++ at around 500,000 lines per second.



#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <map>
#include <vector>

using namespace std;

class progress {
int count;
public:
progress() {
count = 0;
}
void ping()
{
cerr << "\b" << "|/-\\"[count++%4];
}
};

typedef map<string, vector<string>*> error_map;

void print_errors(error_map& errors)
{
for(error_map::iterator iter = errors.begin(); iter != errors.end(); iter++ ) {
vector<string>* reports = iter->second;
if (reports->size() > 1)
{
string first = (*reports)[0];
cout << "\n\n\n";
cout << reports->size() << " instances of\n";
cout << "FIRST instance " << first << "\n";
if (reports->size() > 1)
cout << "LAST instance " << (*reports)[reports->size() -1] << "\n";

cout << iter->first;
}
}
}

void process_file(char* filename, error_map& errors)
{
progress p;
cerr << "\bProcessing " << filename << "\n";

ifstream file(filename);
string line;
string pending_error;
string pending_stack;
int stack_lines = 0;
int line_number = 1;
bool skipping = false;

while (getline(file, line))
{
if (line_number % 100000 == 0)
p.ping();

line_number++;

if (line[0] == '#' || line.find("Notice") != string::npos)
{
if (pending_stack.size() != 0)
{
// Process stack trace
if (errors[pending_stack] == NULL)
errors[pending_stack] = new vector<string>();

errors[pending_stack]->insert(errors[pending_stack]->end(), pending_error);
pending_stack.clear();
}
// Ignore lines from systems we are not interested in
skipping = line.find("ignore-one") != string::npos
|| line.find("ignore-two") != string::npos
|| line.find("ignore-three") != string::npos
|| line.find("INFO") != string::npos
|| line.find("WARN") != string::npos;

if (!skipping)
pending_error = line;

stack_lines = 0;
}
else
{
if (!skipping && stack_lines < 20)
{
pending_stack.append(line);
pending_stack.append("\n");
}
stack_lines++;
}
}
}


int main (int argc, char * const argv[]) {
error_map* e = new map<string, vector<string>*>();
error_map& errors = *e;

for (int i = 1; i < argc; i++)
{
process_file(argv[i], errors);
}

print_errors(errors);
return 0;
}




The Python solution is a little shorter however



import os, sys

def is_valid(item):
for token in ['ignore-one', 'ignore-two', 'ignore-tree', 'NavigationLink instance']:
if token in item:
return False
return True

directory = sys.argv[1]

errors = {}

for filename in os.listdir(directory):
last_error = ''
last_stack = ''
stack_count = 0

file_path = os.path.join(directory, filename)
print 'Processing', file_path

source = open(file_path)

for line in source:
if line.startswith('#') or ('<Notice>' in line):
if is_valid(last_stack) and is_valid(last_error):
errors.setdefault(last_stack, []).append(last_error)
last_error = line
last_stack = ''
stack_count = 0
else:
stack_count += 1
if stack_count <= 20:
last_stack += line

source.close()

print 'Writing report to grok.txt'

out_file = open('grok.txt', 'w')

for stack, error_list in errors.iteritems():
if (len(error_list) > 1) and (len(stack.strip()) > 0):
out_file.write('Found %d items like: %s' % (len(error_list), error_list[0]))
out_file.write(stack)
out_file.write('\n\n')
out_file.write('---------------------------------------------------------');
out_file.write('\n\n')

out_file.close()


Both solutions limited the number of lines in the stack trace used for the key to 20. This was fine for non-reflected methods.




No comments: