Paul's web page

Programming

:: home :: personal :: programming

Here are a few small projects I've been working on in my spare time

Bayesian mail filter

This is based on an article by Paul Graham I saw through slashdot, which describes a form of email filtering based upon naive Bayesian statistics.

The filter comes in two parts:

makecorpus.pl
The first part examines your sorted mailboxes and calculates the probability that, given any particular word in an email, that that word came from a particular mailbox.
evaluatemailc.pl and evalautemaild.pl
The second part then examines any incoming email and uses the probabilities calculated before to calculate the total probability that that email should go into each mailbox. This part actually consists of two programs: evaluatemaild.pl is a daemon program with a limited lifetime which caches the probability corpus and evaluates emails. evaluatemailc.pl is a client program designed to be called by procmail; it can start the daemon if necessary and communicates with it. This splitting is necessary because the probability corpus can be fairly large: I found that reading it into memory can take up to 10 seconds on an unloaded machine and can cause problems on a mailserver if it happens once for each mail received.

The filter also distinguishes between "spam" folders and useful folders, by reducing the probability that messages get sorted into "spam" folders unless the evidence is overwhelming.

Please note that in programming this thing, my nomenclature has changed a little from that in Paul Graham's article: he uses "corpus" to mean the corpus (or body) of emails used to generate probabilities; I've used it to mean the body of generated probabilities. It doesn't seem to me like awful usage of the word, but it's different which is why I've mentioned it.

Web page management

I use this system to generate the consistent look and feel of my web page. It's really very simple: a template is used to generate the header and sidebars and the layout of the page. Placeholders in the template are then filled from "content" files by a perl script. One slightly clever thing is that the "modified" dates are based upon the last time the content files were modified, as opposed to the date the script was run.

The system works very nicely with make.