Fast & Security Conscious
High-Availability Unix Hosting
Bayesian filtering micro-howto

This guide assumes that you have already configured spam filtering, either from the web interface or the shell interface with the "mail rule" commands.

The technique of using a Bayesian combination of the spam probabilities ("spamicities") of individual words is described by this article, it is possible to filter spam very effectively, using minimal server resources. All you need is a confirmed spam corpus, a non-spam (ham) corpus and a Bayesian mail filter. SpamAssassin provides a simple built-in Bayesian filter.

Registering known spam/ham

This is where you can finally put the old spam you have accumulated to good use. Using sa-learn, you can register the contents of a mailbox or a maildir as spam or non-spam. If the spam corpus is particularly large, this might take a long time.

  # Register SPAM from a mbox(5) file:
  $ nice sa-learn --spam --mbox ~/Mail/Mailbox.spam
  # Register HAM from a mbox(5) file:
  $ nice sa-learn --ham --mbox ~/Mail/Mailbox.ham
  
  # Register SPAM from a maildir(5):
  $ nice sa-learn --spam --maildir ~/Mail/Maildir.spam
  # Register HAM from a maildir(5):
  $ nice sa-learn --ham --maildir ~/Mail/Maildir.ham

Feedback

After you have registered your initial batch of messages, you will occasionally want to feed sa-learn more recent spam and ham samples. Bayesian filtering requires such feedback in order to be fully effective. For convenience, you will want to integrate the learning process with your mail reader.

Some spammers try to evade Bayesian filtering by including lists of random words in the message. When you are registering spam with your Bayesian filter, try to remove randomly generated words from the message using an editor beforehand.

Mutt macros

If you are using mutt, you can add the following to your muttrc file so that specific keys can be used to trigger the learning of the selected message, either as spam or as ham (legitimate e-mail). Using SpamAssassin, you may find it convenient to create a macro to invoke the --rebuild function of sa-learn, which is used to update the Bayesian filtering database.

  set wait_key=no
  macro index H "|sa-learn --ham --no-rebuild --single"
  macro pager H "|sa-learn --ham --no-rebuild --single"
  macro index S "|sa-learn --spam --no-rebuild --single"
  macro pager S "|sa-learn --spam --no-rebuild --single"
  macro index R "|sa-learn --rebuild"
  macro pager R "|sa-learn --rebuild"

With the SpamAssassin method, sa-learn is a Perl script and may take a while to execute so it may be more convenient to use mutt's s key to append the message to files (say ~/s for spam and ~/h for ham), and then feed the contents of those files to sa-learn at some later time.


  End Software Patents!