I am definitely with Jeff Atwood on when it is appropriate to touch monitors - NEVER.
Tuesday, May 13, 2008
Tuesday, April 29, 2008
Further perf ruby, python C++ file reading
Following on from the log files article I decided to do some basic perf checks of ruby and python reading text files. The results were a little disapointing - performance was roughly the same, so my ruby log file reading optimisation was complete rot.
Further experimentation required.
Processing /Users/gcb/work/log-analysis/cc.rb ... created /Users/gcb/work/log-analysis/cc.rb.html | Realy simple script - and probably the most obvious - add up the length of all the lines in the file. |
|
Processing /Users/gcb/work/log-analysis/cc1.rb ... created /Users/gcb/work/log-analysis/cc1.rb.html | Based on previoud observations this one uses the realine method from the IO library but did not affect the performance. |
|
import sys Processing /Users/gcb/work/log-analysis/cc.py ... created /Users/gcb/work/log-analysis/cc.py.html | As a benchmark a simple python scrpt - again adding up all the line lengths in the file. |
|
#include <stdio.h> Processing /Users/gcb/work/log-analysis/cc.cpp ... created /Users/gcb/work/log-analysis/cc.cpp.html | Baseline written in C++ |
|
#include <stdio.h> Processing /Users/gcb/work/log-analysis/cc1.cpp ... created /Users/gcb/work/log-analysis/cc1.cpp.html | A (poor) buffered version of the baseline written in C++ |
|
Sunday, April 27, 2008
Blogging Code
I quite often find myself blogging about program source code, that code is typically stored in source files which I then run through a pretty printer (something like source-highlight). Combining everyting together means some copy and pasting - not the most repeatable process and quite often the code and article evolve together - so I end up copying and pasting quite often.
So I came up with mashup. A small ruby program to process html files and handle include directives to do inline include of another file and for this purpose the results of a process
The following source was include with
<x:include value="source-highlight -o STDOUT ~/projects/mashup/mashup"/>
By running:
mashup blogging-code.html > blogging-code-publish.html
#!/usr/bin/ruby
ARGV.each do |arg|
contents = File.new(arg).read()
contents.sub!(/<s:include\s+value="([^"]*)"\s*\/>/) do |match|
replacement = File.new($1).read()
replacement.gsub!(/.*<body>/m, '')
replacement.gsub!(/<\/body>.*/m, '')
replacement
end
contents.sub!(/<x:include\s+value="([^"]*)"\s*\/>/) do |match|
replacement = `#{$1}`
replacement.gsub!(/.*<body>/m, '')
replacement.gsub!(/<\/body>.*/m, '')
replacement
end
puts contents
end
Monday, April 14, 2008
Log files
Log Files
Log files are one of those must have things for any web application. It is just so hard to predict all of the possible ways users are going to interact with the site that gathering post live information about application behaviour is essential. It does however produce quite a lot of data.
On my current project, after a fairly significant release we resolved to check the log files to see if there were any unexpected incidents that required fixes or web content changes. The log files ran to several gigabytes containing entries not only from the application but also from a very noisy subsystems. After a quick look it became evident that it would not be effective just scanning through the log files but that some cleaning or automated analysis was required.
On the initial scan we noticed that there were some Java stack traces being repeated so something that could capture the distinct stack traces and then list the errors that caused them. In this was we could look at the general issues based on priority (number of occurrences).
Our first effort was to write a fairly simple ruby script using a hash map keyed on the text of the stack trace. Each map entry contained an array of error lines from the log files. We kicked the script off after testing with a small log file and went to lunch.
When we came back the script was still running. Sometime later it ran out of memory - not ideal.
It has been sometime since I have written any C++. Most of my work these days involves Java, C# and a little bit of Ruby for work around the codebase so it took a little while for my C++ brain to kick in. A colleague also took up the challenge by writing a solution in Python.
The results were quite startling with the Python script performing almost as well as the C++ at around 500,000 lines per second.
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <map>
#include <vector>
using namespace std;
class progress {
int count;
public:
progress() {
count = 0;
}
void ping()
{
cerr << "\b" << "|/-\\"[count++%4];
}
};
typedef map<string, vector<string>*> error_map;
void print_errors(error_map& errors)
{
for(error_map::iterator iter = errors.begin(); iter != errors.end(); iter++ ) {
vector<string>* reports = iter->second;
if (reports->size() > 1)
{
string first = (*reports)[0];
cout << "\n\n\n";
cout << reports->size() << " instances of\n";
cout << "FIRST instance " << first << "\n";
if (reports->size() > 1)
cout << "LAST instance " << (*reports)[reports->size() -1] << "\n";
cout << iter->first;
}
}
}
void process_file(char* filename, error_map& errors)
{
progress p;
cerr << "\bProcessing " << filename << "\n";
ifstream file(filename);
string line;
string pending_error;
string pending_stack;
int stack_lines = 0;
int line_number = 1;
bool skipping = false;
while (getline(file, line))
{
if (line_number % 100000 == 0)
p.ping();
line_number++;
if (line[0] == '#' || line.find("Notice") != string::npos)
{
if (pending_stack.size() != 0)
{
// Process stack trace
if (errors[pending_stack] == NULL)
errors[pending_stack] = new vector<string>();
errors[pending_stack]->insert(errors[pending_stack]->end(), pending_error);
pending_stack.clear();
}
// Ignore lines from systems we are not interested in
skipping = line.find("ignore-one") != string::npos
|| line.find("ignore-two") != string::npos
|| line.find("ignore-three") != string::npos
|| line.find("INFO") != string::npos
|| line.find("WARN") != string::npos;
if (!skipping)
pending_error = line;
stack_lines = 0;
}
else
{
if (!skipping && stack_lines < 20)
{
pending_stack.append(line);
pending_stack.append("\n");
}
stack_lines++;
}
}
}
int main (int argc, char * const argv[]) {
error_map* e = new map<string, vector<string>*>();
error_map& errors = *e;
for (int i = 1; i < argc; i++)
{
process_file(argv[i], errors);
}
print_errors(errors);
return 0;
}
The Python solution is a little shorter however
import os, sys
def is_valid(item):
for token in ['ignore-one', 'ignore-two', 'ignore-tree', 'NavigationLink instance']:
if token in item:
return False
return True
directory = sys.argv[1]
errors = {}
for filename in os.listdir(directory):
last_error = ''
last_stack = ''
stack_count = 0
file_path = os.path.join(directory, filename)
print 'Processing', file_path
source = open(file_path)
for line in source:
if line.startswith('#') or ('<Notice>' in line):
if is_valid(last_stack) and is_valid(last_error):
errors.setdefault(last_stack, []).append(last_error)
last_error = line
last_stack = ''
stack_count = 0
else:
stack_count += 1
if stack_count <= 20:
last_stack += line
source.close()
print 'Writing report to grok.txt'
out_file = open('grok.txt', 'w')
for stack, error_list in errors.iteritems():
if (len(error_list) > 1) and (len(stack.strip()) > 0):
out_file.write('Found %d items like: %s' % (len(error_list), error_list[0]))
out_file.write(stack)
out_file.write('\n\n')
out_file.write('---------------------------------------------------------');
out_file.write('\n\n')
out_file.close()
Both solutions limited the number of lines in the stack trace used for the key to 20. This was fine for non-reflected methods.
Saturday, March 1, 2008
Ruby initialise array and add in one step
Given a hash where each value is an array of values this gives a nice concise way of setting things up (even if it is a little obscure).
map = {}
(map[:key] ||= []) << :value
The above results in a hash containing a key value of :key with an array containing :value
Tuesday, February 26, 2008
cruisecontrolrb and java
I have just started an open source java project for stack trance analysis called why and wanted a build server so thought I would give cruisecontrol.rb a go.
The download took a couple of minutes. Adding the project (cruise add why --url https://why.svn.sourceforge.net/svnroot/why) took a few more minutes to do the initial checkout.
Started cruise (cruise start) in a command shell and...build failed.
I needed to provide a custom build command in the cruise_config.rb file in projects/why:
Project.configure do |project|
project.build_command = 'b'
end
I already had a shell script called b to run ant and build the project. Result - build failed.
Small scratching of head and a chmod +x b later and - build passed.
From start to finish about 20 mins - most of which was download and initial checkout.
Sunday, February 3, 2008
Tagging builds in subversion
At the end of every build I like to add a tag to the project source repository. Although the build log contains the revision of the source being built and so the build can be re-created, I like the fact that all the information about the source and the build is in one place.
<target name="tag-build" description="tag the build revision" >
<svn javahl="false">
<status path="${basedir}" revisionProperty="svn.revision" />
</svn>
<echo message="Tagging revision ${svn.revision} as tag ${label}" />
<svn username="cruise" password="cruise" javahl="false">
<copy srcURL="http://repository/path/trunk" revision="${svn.revision}" destUrl="http://repository/path/tags/${label}" message="Cruise: Tagging build ${label}" />
</svn>
</target>