So you want to get involved?
Running the code
This project works with Python 2.2 or later, available from Subversion on SourceForge. It will not work on python 2.1.x or earlier, nor is it ever likely to do so.
If you're running Python 2.2 or 2.2.1, you'll need to separately fetch the latest email package. You can get this from sourceforge (you'll need version 2.4.3 or later - version 3.0 or later is recommended).
I just want to make suggestions
Excellent! Note though, that this project takes a very results-oriented approach to code changes - if the change doesn't produce an improvement in results from various test corpuses, it's not going to get very far.
Note that a lot of "intuitive" approaches and ideas end up making things worse, not better - it seems that stupid beats smart in many or even most cases.
There's a bunch of documentation on things that have already been tried available as links from the documentation page.
So what needs to be done
1.0 was released in July 2004, and was followed up by three bugfix releases starting in November 2004. The current stable release is 1.0.4. This is likely to be the final release in the 1.0.x line.
Since May 2004, work has been carried out on a 1.1 release, which includes many improvements, as well as bug fixes, compared to the 1.0.x branch. The latest alpha release is 1.1a6 (April 2010). If we could find more time or more help we could get to beta, release candidate and final releases of 1.1. We hope that a stable 1.1 release will be made during 2007, although this date is certainly not fixed.
The 1.1 line will be frozen for non-bugfix changes from the first beta release. Many of the changes desired by the developers have been implemented, or partly so, but there is still time for further improvement. There is no time limit on implementing bug fixes.
Some key work that is in progress for 1.1, which you could assist with (particularly in testing) includes:
The other big body of work is monitoring the bug reports and feature requests that come in and trying to resolve those.
Collecting training data
One of the tricky problems is collecting a set of data that's "good enough" There's a few collections of spam out on the net - note though, that using spam and ham from different sources often leads to the classifier picking up on these clues -- for instance, a different set of hostnames in the Received: headers. This isn't a killer problem - it just means that you need to think harder about what to feed into the classifier.