Crafting Software: April 2008

Tuesday, April 29, 2008

Further perf ruby, python C++ file reading

Following on from the log files article I decided to do some basic perf checks of ruby and python reading text files. The results were a little disapointing - performance was roughly the same, so my ruby log file reading optimisation was complete rot.

Further experimentation required.

`ARGV.each do \| param \| cc = 0 File.new(param, 'r').each_line do \|line\| cc += line.size end puts "File has #{cc} characters" end` Processing /Users/gcb/work/log-analysis/cc.rb ... created /Users/gcb/work/log-analysis/cc.rb.html	Realy simple script - and probably the most obvious - add up the length of all the lines in the file.	File has 1673435763 characters real 0m56.035s user 0m33.873s sys 0m3.609s
`ARGV.each do \| param \| cc = 0 i = File.open(param, "r") begin line = i.readline() until line.nil? cc += line.size line = i.readline() end rescue Exception => e ensure i.close end puts "File has #{cc} characters" end` Processing /Users/gcb/work/log-analysis/cc1.rb ... created /Users/gcb/work/log-analysis/cc1.rb.html	Based on previoud observations this one uses the realine method from the IO library but did not affect the performance.	File has 1673435763 characters real 0m55.569s user 0m35.506s sys 0m3.451s
`import sys cc = 0 source = open(sys.argv[1]) for line in source: cc += len(line) source.close() print 'file has ', cc, ' characters'` Processing /Users/gcb/work/log-analysis/cc.py ... created /Users/gcb/work/log-analysis/cc.py.html	As a benchmark a simple python scrpt - again adding up all the line lengths in the file.	file has 1673435763 characters real 0m53.462s user 0m23.147s sys 0m3.781s
`#include <stdio.h> int main(int argc, char** argv) { int count = 0; FILE* f = fopen(argv[1], "r"); while (getc(f)) count++; printf("File has %d characters\n", count); }` Processing /Users/gcb/work/log-analysis/cc.cpp ... created /Users/gcb/work/log-analysis/cc.cpp.html	Baseline written in C++	File has 1673392372 characters real 0m53.167s user 0m31.473s sys 0m3.094s
`#include <stdio.h> int main(int argc, char** argv) { int count = 0; FILE* f = fopen(argv[1], "r"); char buffer[512]; int read = fread(buffer, 1, 512, f); while (read > 0) { count += read; read = fread(buffer, 1, 512, f); } printf("File has %d characters\n", count); }` Processing /Users/gcb/work/log-analysis/cc1.cpp ... created /Users/gcb/work/log-analysis/cc1.cpp.html	A (poor) buffered version of the baseline written in C++	File has 1673435763 characters real 0m52.425s user 0m1.526s sys 0m4.473s

Sunday, April 27, 2008

Blogging Code

I quite often find myself blogging about program source code, that code is typically stored in source files which I then run through a pretty printer (something like source-highlight). Combining everyting together means some copy and pasting - not the most repeatable process and quite often the code and article evolve together - so I end up copying and pasting quite often.

So I came up with mashup. A small ruby program to process html files and handle include directives to do inline include of another file and for this purpose the results of a process

The following source was include with


<x:include value="source-highlight -o STDOUT ~/projects/mashup/mashup"/>

By running:


mashup blogging-code.html > blogging-code-publish.html

#!/usr/bin/ruby

ARGV.each do |arg|
  contents = File.new(arg).read()

  contents.sub!(/<s:include\s+value="([^"]*)"\s*\/>/) do |match|
    replacement = File.new($1).read()
    replacement.gsub!(/.*<body>/m, '')
    replacement.gsub!(/<\/body>.*/m, '')
    replacement
  end

  contents.sub!(/<x:include\s+value="([^"]*)"\s*\/>/) do |match|
    replacement = `#{$1}`
    replacement.gsub!(/.*<body>/m, '')
    replacement.gsub!(/<\/body>.*/m, '')
    replacement
  end

  puts contents
end

Monday, April 14, 2008

Log files

<br /> Log Files<br />

Log files are one of those must have things for any web application. It is just so hard to predict all of the possible ways users are going to interact with the site that gathering post live information about application behaviour is essential. It does however produce quite a lot of data.

On my current project, after a fairly significant release we resolved to check the log files to see if there were any unexpected incidents that required fixes or web content changes. The log files ran to several gigabytes containing entries not only from the application but also from a very noisy subsystems. After a quick look it became evident that it would not be effective just scanning through the log files but that some cleaning or automated analysis was required.

On the initial scan we noticed that there were some Java stack traces being repeated so something that could capture the distinct stack traces and then list the errors that caused them. In this was we could look at the general issues based on priority (number of occurrences).

Our first effort was to write a fairly simple ruby script using a hash map keyed on the text of the stack trace. Each map entry contained an array of error lines from the log files. We kicked the script off after testing with a small log file and went to lunch.

When we came back the script was still running. Sometime later it ran out of memory - not ideal.

It has been sometime since I have written any C++. Most of my work these days involves Java, C# and a little bit of Ruby for work around the codebase so it took a little while for my C++ brain to kick in. A colleague also took up the challenge by writing a solution in Python.

The results were quite startling with the Python script performing almost as well as the C++ at around 500,000 lines per second.

#include <iostream>
    #include <fstream>
    #include <sstream>
    #include <string>
    #include <map>
    #include <vector>

    using namespace std;

    class progress {
       int count;
    public:
       progress() {
          count = 0;
       }
       void ping()
       {
          cerr << "\b" << "|/-\\"[count++%4];
       }
    };

    typedef map<string, vector<string>*> error_map;

    void print_errors(error_map& errors)
    {
       for(error_map::iterator iter = errors.begin(); iter != errors.end(); iter++ ) {
          vector<string>* reports = iter->second;
          if (reports->size() > 1)
          {
             string first = (*reports)[0];
             cout << "\n\n\n";
             cout << reports->size() << " instances of\n";
             cout << "FIRST instance " << first << "\n";
             if (reports->size() > 1)
                cout << "LAST  instance " << (*reports)[reports->size() -1] << "\n";

             cout << iter->first;
          }
       }
    }

    void process_file(char* filename, error_map& errors)
    {
       progress p;
       cerr << "\bProcessing " << filename << "\n";

       ifstream file(filename);
       string line;
       string pending_error;
       string pending_stack;
       int stack_lines = 0;
       int line_number = 1;
       bool skipping = false;

       while (getline(file, line))
       {
          if (line_number % 100000 == 0)
             p.ping();

          line_number++;

          if (line[0] == '#' || line.find("Notice") != string::npos)
          {
             if (pending_stack.size() != 0)
             {
                // Process stack trace
                if (errors[pending_stack] == NULL)
                   errors[pending_stack] = new vector<string>();

                errors[pending_stack]->insert(errors[pending_stack]->end(), pending_error);
                pending_stack.clear();
             }
             // Ignore lines from systems we are not interested in
             skipping = line.find("ignore-one") != string::npos
             || line.find("ignore-two") != string::npos
             || line.find("ignore-three") != string::npos
             || line.find("INFO") != string::npos
             || line.find("WARN") != string::npos;

             if (!skipping)
                pending_error = line;

             stack_lines = 0;
          }
          else
          {
             if (!skipping && stack_lines < 20)
             {
                pending_stack.append(line);
                pending_stack.append("\n");
             }
             stack_lines++;
          }
       }
    }


    int main (int argc, char * const argv[]) {
       error_map* e = new map<string, vector<string>*>();
       error_map& errors = *e;

       for (int i = 1; i < argc; i++)
       {
          process_file(argv[i], errors);
       }

       print_errors(errors);  
       return 0;
    }

The Python solution is a little shorter however

import os, sys

    def is_valid(item):
        for token in ['ignore-one', 'ignore-two', 'ignore-tree', 'NavigationLink instance']:
            if token in item:
                return False
        return True

    directory = sys.argv[1]

    errors = {}

    for filename in os.listdir(directory):
        last_error = ''
        last_stack = ''
        stack_count = 0

        file_path = os.path.join(directory, filename)
        print 'Processing', file_path

        source = open(file_path)

        for line in source:
            if line.startswith('#') or ('<Notice>' in line):
                if is_valid(last_stack) and is_valid(last_error):
                    errors.setdefault(last_stack, []).append(last_error)
                last_error = line
                last_stack = ''
                stack_count = 0
            else:
                stack_count += 1
                if stack_count <= 20:
                    last_stack += line

        source.close()

    print 'Writing report to grok.txt'

    out_file = open('grok.txt', 'w')

    for stack, error_list in errors.iteritems():
        if (len(error_list) > 1) and (len(stack.strip()) > 0):
            out_file.write('Found %d items like: %s' % (len(error_list), error_list[0]))
            out_file.write(stack)
            out_file.write('\n\n')
            out_file.write('---------------------------------------------------------');
            out_file.write('\n\n')

    out_file.close()

Both solutions limited the number of lines in the stack trace used for the key to 20. This was fine for non-reflected methods.

Crafting Software

Tuesday, April 29, 2008

Further perf ruby, python C++ file reading

Sunday, April 27, 2008

Blogging Code

Monday, April 14, 2008

Log files

Graham Brooks

Blog Roll

Blog Archive

About Me