Tuesday, April 29, 2008

Further perf ruby, python C++ file reading

Following on from the log files article I decided to do some basic perf checks of ruby and python reading text files. The results were a little disapointing - performance was roughly the same, so my ruby log file reading optimisation was complete rot.



Further experimentation required.































ARGV.each do | param |
cc = 0
File.new(param, 'r').each_line do |line|
cc += line.size
end
puts "File has #{cc} characters"
end

Processing /Users/gcb/work/log-analysis/cc.rb ... created /Users/gcb/work/log-analysis/cc.rb.html


Realy simple script - and probably the most obvious - add up the length of all the lines in the file.


File has 1673435763 characters

real 0m56.035s
user 0m33.873s
sys 0m3.609s





ARGV.each do | param |
cc = 0
i = File.open(param, "r")
begin
line = i.readline()
until line.nil?
cc += line.size
line = i.readline()
end
rescue Exception => e
ensure
i.close
end
puts "File has #{cc} characters"
end

Processing /Users/gcb/work/log-analysis/cc1.rb ... created /Users/gcb/work/log-analysis/cc1.rb.html


Based on previoud observations this one uses the realine method from the IO library but did not affect the performance.


File has 1673435763 characters

real 0m55.569s
user 0m35.506s
sys 0m3.451s




import sys
cc = 0

source = open(sys.argv[1])
for line in source:
cc += len(line)
source.close()
print 'file has ', cc, ' characters'

Processing /Users/gcb/work/log-analysis/cc.py ... created /Users/gcb/work/log-analysis/cc.py.html


As a benchmark a simple python scrpt - again adding up all the line lengths in the file.


file has 1673435763 characters

real 0m53.462s
user 0m23.147s
sys 0m3.781s




#include <stdio.h>


int main(int argc, char** argv)
{
int count = 0;
FILE* f = fopen(argv[1], "r");

while (getc(f))
count++;

printf("File has %d characters\n", count);
}

Processing /Users/gcb/work/log-analysis/cc.cpp ... created /Users/gcb/work/log-analysis/cc.cpp.html


Baseline written in C++


File has 1673392372 characters

real 0m53.167s
user 0m31.473s
sys 0m3.094s




#include <stdio.h>


int main(int argc, char** argv)
{
int count = 0;
FILE* f = fopen(argv[1], "r");

char buffer[512];
int read = fread(buffer, 1, 512, f);

while (read > 0) {
count += read;
read = fread(buffer, 1, 512, f);
}

printf("File has %d characters\n", count);
}

Processing /Users/gcb/work/log-analysis/cc1.cpp ... created /Users/gcb/work/log-analysis/cc1.cpp.html


A (poor) buffered version of the baseline written in C++


File has 1673435763 characters

real 0m52.425s
user 0m1.526s
sys 0m4.473s



Sunday, April 27, 2008

Blogging Code



Blogging Code


I quite often find myself blogging about program source code, that code is typically stored in source files which I then run through a pretty printer (something like source-highlight). Combining everyting together means some copy and pasting - not the most repeatable process and quite often the code and article evolve together - so I end up copying and pasting quite often.


So I came up with mashup. A small ruby program to process html files and handle include directives to do inline include of another file and for this purpose the results of a process


The following source was include with



<x:include value="source-highlight -o STDOUT ~/projects/mashup/mashup"/>

By running:



mashup blogging-code.html > blogging-code-publish.html


#!/usr/bin/ruby

ARGV.each do |arg|
contents = File.new(arg).read()

contents.sub!(/<s:include\s+value="([^"]*)"\s*\/>/) do |match|
replacement = File.new($1).read()
replacement.gsub!(/.*<body>/m, '')
replacement.gsub!(/<\/body>.*/m, '')
replacement
end

contents.sub!(/<x:include\s+value="([^"]*)"\s*\/>/) do |match|
replacement = `#{$1}`
replacement.gsub!(/.*<body>/m, '')
replacement.gsub!(/<\/body>.*/m, '')
replacement
end

puts contents
end



Monday, April 14, 2008

Log files




<br /> Log Files<br />



Log files are one of those must have things for any web application. It is just so hard to predict all of the possible ways users are going to interact with the site that gathering post live information about application behaviour is essential. It does however produce quite a lot of data.



On my current project, after a fairly significant release we resolved to check the log files to see if there were any unexpected incidents that required fixes or web content changes. The log files ran to several gigabytes containing entries not only from the application but also from a very noisy subsystems. After a quick look it became evident that it would not be effective just scanning through the log files but that some cleaning or automated analysis was required.



On the initial scan we noticed that there were some Java stack traces being repeated so something that could capture the distinct stack traces and then list the errors that caused them. In this was we could look at the general issues based on priority (number of occurrences).



Our first effort was to write a fairly simple ruby script using a hash map keyed on the text of the stack trace. Each map entry contained an array of error lines from the log files. We kicked the script off after testing with a small log file and went to lunch.



When we came back the script was still running. Sometime later it ran out of memory - not ideal.



It has been sometime since I have written any C++. Most of my work these days involves Java, C# and a little bit of Ruby for work around the codebase so it took a little while for my C++ brain to kick in. A colleague also took up the challenge by writing a solution in Python.



The results were quite startling with the Python script performing almost as well as the C++ at around 500,000 lines per second.



#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <map>
#include <vector>

using namespace std;

class progress {
int count;
public:
progress() {
count = 0;
}
void ping()
{
cerr << "\b" << "|/-\\"[count++%4];
}
};

typedef map<string, vector<string>*> error_map;

void print_errors(error_map& errors)
{
for(error_map::iterator iter = errors.begin(); iter != errors.end(); iter++ ) {
vector<string>* reports = iter->second;
if (reports->size() > 1)
{
string first = (*reports)[0];
cout << "\n\n\n";
cout << reports->size() << " instances of\n";
cout << "FIRST instance " << first << "\n";
if (reports->size() > 1)
cout << "LAST instance " << (*reports)[reports->size() -1] << "\n";

cout << iter->first;
}
}
}

void process_file(char* filename, error_map& errors)
{
progress p;
cerr << "\bProcessing " << filename << "\n";

ifstream file(filename);
string line;
string pending_error;
string pending_stack;
int stack_lines = 0;
int line_number = 1;
bool skipping = false;

while (getline(file, line))
{
if (line_number % 100000 == 0)
p.ping();

line_number++;

if (line[0] == '#' || line.find("Notice") != string::npos)
{
if (pending_stack.size() != 0)
{
// Process stack trace
if (errors[pending_stack] == NULL)
errors[pending_stack] = new vector<string>();

errors[pending_stack]->insert(errors[pending_stack]->end(), pending_error);
pending_stack.clear();
}
// Ignore lines from systems we are not interested in
skipping = line.find("ignore-one") != string::npos
|| line.find("ignore-two") != string::npos
|| line.find("ignore-three") != string::npos
|| line.find("INFO") != string::npos
|| line.find("WARN") != string::npos;

if (!skipping)
pending_error = line;

stack_lines = 0;
}
else
{
if (!skipping && stack_lines < 20)
{
pending_stack.append(line);
pending_stack.append("\n");
}
stack_lines++;
}
}
}


int main (int argc, char * const argv[]) {
error_map* e = new map<string, vector<string>*>();
error_map& errors = *e;

for (int i = 1; i < argc; i++)
{
process_file(argv[i], errors);
}

print_errors(errors);
return 0;
}




The Python solution is a little shorter however



import os, sys

def is_valid(item):
for token in ['ignore-one', 'ignore-two', 'ignore-tree', 'NavigationLink instance']:
if token in item:
return False
return True

directory = sys.argv[1]

errors = {}

for filename in os.listdir(directory):
last_error = ''
last_stack = ''
stack_count = 0

file_path = os.path.join(directory, filename)
print 'Processing', file_path

source = open(file_path)

for line in source:
if line.startswith('#') or ('<Notice>' in line):
if is_valid(last_stack) and is_valid(last_error):
errors.setdefault(last_stack, []).append(last_error)
last_error = line
last_stack = ''
stack_count = 0
else:
stack_count += 1
if stack_count <= 20:
last_stack += line

source.close()

print 'Writing report to grok.txt'

out_file = open('grok.txt', 'w')

for stack, error_list in errors.iteritems():
if (len(error_list) > 1) and (len(stack.strip()) > 0):
out_file.write('Found %d items like: %s' % (len(error_list), error_list[0]))
out_file.write(stack)
out_file.write('\n\n')
out_file.write('---------------------------------------------------------');
out_file.write('\n\n')

out_file.close()


Both solutions limited the number of lines in the stack trace used for the key to 20. This was fine for non-reflected methods.