Crafting Software

Tuesday, May 13, 2008

Keyboard touching

I am definitely with Jeff Atwood on when it is appropriate to touch monitors - NEVER.

Tuesday, April 29, 2008

Further perf ruby, python C++ file reading

Following on from the log files article I decided to do some basic perf checks of ruby and python reading text files. The results were a little disapointing - performance was roughly the same, so my ruby log file reading optimisation was complete rot.

Further experimentation required.

`ARGV.each do \| param \| cc = 0 File.new(param, 'r').each_line do \|line\| cc += line.size end puts "File has #{cc} characters" end` Processing /Users/gcb/work/log-analysis/cc.rb ... created /Users/gcb/work/log-analysis/cc.rb.html	Realy simple script - and probably the most obvious - add up the length of all the lines in the file.	File has 1673435763 characters real 0m56.035s user 0m33.873s sys 0m3.609s
`ARGV.each do \| param \| cc = 0 i = File.open(param, "r") begin line = i.readline() until line.nil? cc += line.size line = i.readline() end rescue Exception => e ensure i.close end puts "File has #{cc} characters" end` Processing /Users/gcb/work/log-analysis/cc1.rb ... created /Users/gcb/work/log-analysis/cc1.rb.html	Based on previoud observations this one uses the realine method from the IO library but did not affect the performance.	File has 1673435763 characters real 0m55.569s user 0m35.506s sys 0m3.451s
`import sys cc = 0 source = open(sys.argv[1]) for line in source: cc += len(line) source.close() print 'file has ', cc, ' characters'` Processing /Users/gcb/work/log-analysis/cc.py ... created /Users/gcb/work/log-analysis/cc.py.html	As a benchmark a simple python scrpt - again adding up all the line lengths in the file.	file has 1673435763 characters real 0m53.462s user 0m23.147s sys 0m3.781s
`#include <stdio.h> int main(int argc, char** argv) { int count = 0; FILE* f = fopen(argv[1], "r"); while (getc(f)) count++; printf("File has %d characters\n", count); }` Processing /Users/gcb/work/log-analysis/cc.cpp ... created /Users/gcb/work/log-analysis/cc.cpp.html	Baseline written in C++	File has 1673392372 characters real 0m53.167s user 0m31.473s sys 0m3.094s
`#include <stdio.h> int main(int argc, char** argv) { int count = 0; FILE* f = fopen(argv[1], "r"); char buffer[512]; int read = fread(buffer, 1, 512, f); while (read > 0) { count += read; read = fread(buffer, 1, 512, f); } printf("File has %d characters\n", count); }` Processing /Users/gcb/work/log-analysis/cc1.cpp ... created /Users/gcb/work/log-analysis/cc1.cpp.html	A (poor) buffered version of the baseline written in C++	File has 1673435763 characters real 0m52.425s user 0m1.526s sys 0m4.473s

Sunday, April 27, 2008

Blogging Code

I quite often find myself blogging about program source code, that code is typically stored in source files which I then run through a pretty printer (something like source-highlight). Combining everyting together means some copy and pasting - not the most repeatable process and quite often the code and article evolve together - so I end up copying and pasting quite often.

So I came up with mashup. A small ruby program to process html files and handle include directives to do inline include of another file and for this purpose the results of a process

The following source was include with


<x:include value="source-highlight -o STDOUT ~/projects/mashup/mashup"/>

By running:


mashup blogging-code.html > blogging-code-publish.html

#!/usr/bin/ruby

ARGV.each do |arg|
  contents = File.new(arg).read()

  contents.sub!(/<s:include\s+value="([^"]*)"\s*\/>/) do |match|
    replacement = File.new($1).read()
    replacement.gsub!(/.*<body>/m, '')
    replacement.gsub!(/<\/body>.*/m, '')
    replacement
  end

  contents.sub!(/<x:include\s+value="([^"]*)"\s*\/>/) do |match|
    replacement = `#{$1}`
    replacement.gsub!(/.*<body>/m, '')
    replacement.gsub!(/<\/body>.*/m, '')
    replacement
  end

  puts contents
end

Monday, April 14, 2008

Log files

<br /> Log Files<br />

Log files are one of those must have things for any web application. It is just so hard to predict all of the possible ways users are going to interact with the site that gathering post live information about application behaviour is essential. It does however produce quite a lot of data.

On my current project, after a fairly significant release we resolved to check the log files to see if there were any unexpected incidents that required fixes or web content changes. The log files ran to several gigabytes containing entries not only from the application but also from a very noisy subsystems. After a quick look it became evident that it would not be effective just scanning through the log files but that some cleaning or automated analysis was required.

On the initial scan we noticed that there were some Java stack traces being repeated so something that could capture the distinct stack traces and then list the errors that caused them. In this was we could look at the general issues based on priority (number of occurrences).

Our first effort was to write a fairly simple ruby script using a hash map keyed on the text of the stack trace. Each map entry contained an array of error lines from the log files. We kicked the script off after testing with a small log file and went to lunch.

When we came back the script was still running. Sometime later it ran out of memory - not ideal.

It has been sometime since I have written any C++. Most of my work these days involves Java, C# and a little bit of Ruby for work around the codebase so it took a little while for my C++ brain to kick in. A colleague also took up the challenge by writing a solution in Python.

The results were quite startling with the Python script performing almost as well as the C++ at around 500,000 lines per second.

#include <iostream>
    #include <fstream>
    #include <sstream>
    #include <string>
    #include <map>
    #include <vector>

    using namespace std;

    class progress {
       int count;
    public:
       progress() {
          count = 0;
       }
       void ping()
       {
          cerr << "\b" << "|/-\\"[count++%4];
       }
    };

    typedef map<string, vector<string>*> error_map;

    void print_errors(error_map& errors)
    {
       for(error_map::iterator iter = errors.begin(); iter != errors.end(); iter++ ) {
          vector<string>* reports = iter->second;
          if (reports->size() > 1)
          {
             string first = (*reports)[0];
             cout << "\n\n\n";
             cout << reports->size() << " instances of\n";
             cout << "FIRST instance " << first << "\n";
             if (reports->size() > 1)
                cout << "LAST  instance " << (*reports)[reports->size() -1] << "\n";

             cout << iter->first;
          }
       }
    }

    void process_file(char* filename, error_map& errors)
    {
       progress p;
       cerr << "\bProcessing " << filename << "\n";

       ifstream file(filename);
       string line;
       string pending_error;
       string pending_stack;
       int stack_lines = 0;
       int line_number = 1;
       bool skipping = false;

       while (getline(file, line))
       {
          if (line_number % 100000 == 0)
             p.ping();

          line_number++;

          if (line[0] == '#' || line.find("Notice") != string::npos)
          {
             if (pending_stack.size() != 0)
             {
                // Process stack trace
                if (errors[pending_stack] == NULL)
                   errors[pending_stack] = new vector<string>();

                errors[pending_stack]->insert(errors[pending_stack]->end(), pending_error);
                pending_stack.clear();
             }
             // Ignore lines from systems we are not interested in
             skipping = line.find("ignore-one") != string::npos
             || line.find("ignore-two") != string::npos
             || line.find("ignore-three") != string::npos
             || line.find("INFO") != string::npos
             || line.find("WARN") != string::npos;

             if (!skipping)
                pending_error = line;

             stack_lines = 0;
          }
          else
          {
             if (!skipping && stack_lines < 20)
             {
                pending_stack.append(line);
                pending_stack.append("\n");
             }
             stack_lines++;
          }
       }
    }


    int main (int argc, char * const argv[]) {
       error_map* e = new map<string, vector<string>*>();
       error_map& errors = *e;

       for (int i = 1; i < argc; i++)
       {
          process_file(argv[i], errors);
       }

       print_errors(errors);  
       return 0;
    }

The Python solution is a little shorter however

import os, sys

    def is_valid(item):
        for token in ['ignore-one', 'ignore-two', 'ignore-tree', 'NavigationLink instance']:
            if token in item:
                return False
        return True

    directory = sys.argv[1]

    errors = {}

    for filename in os.listdir(directory):
        last_error = ''
        last_stack = ''
        stack_count = 0

        file_path = os.path.join(directory, filename)
        print 'Processing', file_path

        source = open(file_path)

        for line in source:
            if line.startswith('#') or ('<Notice>' in line):
                if is_valid(last_stack) and is_valid(last_error):
                    errors.setdefault(last_stack, []).append(last_error)
                last_error = line
                last_stack = ''
                stack_count = 0
            else:
                stack_count += 1
                if stack_count <= 20:
                    last_stack += line

        source.close()

    print 'Writing report to grok.txt'

    out_file = open('grok.txt', 'w')

    for stack, error_list in errors.iteritems():
        if (len(error_list) > 1) and (len(stack.strip()) > 0):
            out_file.write('Found %d items like: %s' % (len(error_list), error_list[0]))
            out_file.write(stack)
            out_file.write('\n\n')
            out_file.write('---------------------------------------------------------');
            out_file.write('\n\n')

    out_file.close()

Both solutions limited the number of lines in the stack trace used for the key to 20. This was fine for non-reflected methods.

Saturday, March 1, 2008

Ruby initialise array and add in one step

I keep forgetting the syntax of this so perhaps this will help.

Given a hash where each value is an array of values this gives a nice concise way of setting things up (even if it is a little obscure).


map = {}

(map[:key] ||= []) << :value

The above results in a hash containing a key value of :key with an array containing :value

Tuesday, February 26, 2008

cruisecontrolrb and java

I have been meaning to setup a cruisecontrol.rb instance for some time. Cruisecontrol.rb is a implementation of continuous integration build server in ruby. It is a simple self-contained download with a really quick initial setup time.

I have just started an open source java project for stack trance analysis called why and wanted a build server so thought I would give cruisecontrol.rb a go.

The download took a couple of minutes. Adding the project (cruise add why --url https://why.svn.sourceforge.net/svnroot/why) took a few more minutes to do the initial checkout.

Started cruise (cruise start) in a command shell and...build failed.

I needed to provide a custom build command in the cruise_config.rb file in projects/why:


Project.configure do |project|
  
  project.build_command = 'b'
end

I already had a shell script called b to run ant and build the project. Result - build failed.

Small scratching of head and a chmod +x b later and - build passed.

From start to finish about 20 mins - most of which was download and initial checkout.

Sunday, February 3, 2008

Tagging builds in subversion

At the end of every build I like to add a tag to the project source repository. Although the build log contains the revision of the source being built and so the build can be re-created, I like the fact that all the information about the source and the build is in one place.


  <target name="tag-build" description="tag the build revision" >
     <svn javahl="false">
        <status path="${basedir}" revisionProperty="svn.revision" />
     </svn>
     <echo message="Tagging revision ${svn.revision} as tag ${label}" />
     <svn username="cruise" password="cruise" javahl="false">
        <copy srcURL="http://repository/path/trunk" revision="${svn.revision}" destUrl="http://repository/path/tags/${label}" message="Cruise: Tagging build ${label}" />
     </svn>
  </target>