Spam filtering and SpamBayes

A summary of some spam filtering methods and description of the SpamBayes project

In the old days, spam filters could be simple in principle, though you needed to be something of a geek to make use of them. Here's an extract from an old and dusty .procmailrc file:

  LOG="Spamfilter caught: $MATCH$NEWLINE"

Serious email geeks will recognize that it matches the To address which used to be commonly forged by spammers and that it includes a misguided attempt to fool the spammer by causing the mail server that was doing the delivery to bounce the message.

Spammers evaded simple filters like that one and more-sophisticated ones were invented. Among those are RBL servers. (The name comes from "realtime black-hole list" but that's not how they're used much these days.) An RBL server keeps a list of mail servers that are known to send or relay spam. When a mail server opens a connection to another mail server and requests that it accept some mail, the recipient server can query an RBL server with the sending mail-server's address. The RBL server replies with the information that it has and the receiving server can then choose to accept the incoming mail or not. There are and have been many RBL servers, run according to all sorts of different policies.

RBLs can be a help, but they're not a very complete solution. For one, it takes time for an IP address to be added. For another, it takes time for one to be removed. They're also inviting targets for spammers to attack. Some have been forced to shut down because of sustained attacks that it would be hard to imagine came from anyone but spammers.

The folks who invented SpamAssassin had a great idea. Their program gathers various indicators of a message's spammyness such as simple pattern-matching like the example above, RBL queries, and a few other techniques and aggregates them into a score which is compared with a threshold to determine whether to classify the mail as spam. Tests that indicate that a message is likely to be ham (that is, not spam) may lower the score. That works much better. If a legitimate message is sent through a server that's listed in one of the RBLs that SpamAssassin consults or a buddy of mine sends me an email that mentions cheap life insurance, the mail is still likely to be delivered because one spammy indicator isn't generally enough to raise the message's spam score to the threshold at which the message isn't delivered. SpamAssassin has another advantage in that geeks can add their own tests and adjust the weight given to the existing ones.

Despite the excellent idea behind it, SpamAssassin has some limitations. For one, it's meant to be run on a mail server. A geeky user could make it work on a desktop machine but I doubt that would be much fun. For another, SpamAssassin eats gobs of CPU power. It's pretty impractical to run it on a mail server that receives many messages per second. But probably its greatest limitation is that spam changes over time and its tests are mostly static. Replies to RBL queries change, of course, and recent versions of SpamAssassin try to learn what words are used in spammy and hammy messages. Still, most of the tests are static and there's nothing to prevent spammers from downloading SpamAssassin and tuning their messages to get around its tests. SpamAssassin is regularly updated and many spammers aren't smart enough to tune their messages to get past it. Still, it would be nice if there were a solution that was better than staying one step ahead of the spammers in an arms race.

And there is. A while ago, Paul Graham invented a brilliant algorithm for detecting spam. Broadly speaking, you train the program on examples of ham and spam and it remembers how often different words appear in each corpus. When you want it to score a new, unknown message, the program looks at the words in it and sees how often they've been seen in ham and spam. So far, so obvious. The brilliant aspect of the algorithm is that it doesn't compute some sort of average of all the words it sees. Instead, it uses only a small, fixed number words to compute its score and the ones it uses are the strongest indicators in each direction it finds in the message. That leaves spammers pretty well stuck. They have to advertise something in their spam and there are only so many misspellings and circumlocutions that they can use. Further, since the algorithm is intended to run on the user's own machine, the spammer can't guess what the users' strong ham indicators are. Spammers are pretty well forced to use spammy words and they can't know what your hammy words are so they can't include them. Paul Graham has more about spammers' attempts to get around this sort of filter and his experience matches mine: they're not succeeding.

Various people have implemented the algorithm and variations on it in various languages. Tim Peters implemented the algorithm in Python and improved the math and various other aspects and called it SpamBayes. Other folks have improved it more. Alas, there hasn't yet been much work on integrating it with any popular mail client other than Microsoft Outlook. Being a geek, I've hacked it into my own mail client and the results have been far better than SpamAssassin's or anything else I can imagine. I get around 300 emails a day, roughly evenly split between spam and ham. About every second day, a spam makes it past SpamBayes into my inbox. In more than six months, I've seen only one legitimate mail filed as spam.

In addition to working well for an individual user, SpamBayes works well on mailing list servers that run narrowly-focused mailing lists. Since all the ham on a list like that will likely have a lot of the same words in it, SpamBayes will have useful clues to work with. I doubt that it would be very effective on a more general mailing list.

By the way, the Bayes in SpamBayes and the descriptions of other implementations of similar algorithems refers to details of how the math is done rather than the matter of keeping track of words and assigning scores based on the ones that appear. I rather suspect that that's going to get (or already has gotten) lost in the discussions of the technique. There's actually only one small Bayesian aspect of SpamBayes. I hope that Reverend Bayes wouldn't mind.

I've done a bit of hacking on SpamBayes. The first thing I looked at was how to reduce the size of the database of words. A look at a few spams shows that they often contains nonsense words. Perhaps the spammers add them to increase the database size of filters such as SpamBayes or perhaps they vary them some as they send copies of a particular spam so as to get past filters that look for multiple identical or near-identical emails. Poking at my SpamBayes database, I found that about half of the words in it had been seen only once. (The term for a word that appears only once in a corpus of text is "hapax legomenon", often shortened to "hapax".) It turned out that removing the hapaxes from a thoroughly-trained database doesn't change SpamBayes's accuracy perceptibly and cutting the database size by half looked like a pretty good benefit. But there's a hitch: if you train progressively (for example, by training on every day's mail or by training on mistakes), removing hapaxes turns out to be a bad idea. That's because many good clues are hapaxes to start with. If you train and then remove hapaxes every day, a word that's a good clue but is used in your email only once a day won't ever stay in your database. So back to the drawing board for me.

It occurred to me that what's really desirable is to preserve words that are likely to be used in scoring. After adding a timestamp that recorded the date and time that a word was used in scoring to my database and running various experiments, I found that a good predictor of how likely a word is to be used in scoring is how recently it has been used in scoring. Playing around with scoring using only words that had been used in scoring recently, I found that I could have SpamBayes score using only words that had been used for scoring in about the past 10 days with hardly any effect on its accuracy. Cutting that number down much farther did affect its accuracy. Still, that indicated that I could prune my database down to about 15% of its original size without doing much harm to SpamBayes's accuracy.

That was true for my email anyhow. But there is some reason to think that my email may not be representative of other folks' email. For one, I read postmaster, webmaster, and other addresses for several domains. Spammers seem to love to send their junk to those addresses so I often get multiple copies of the same spam. For another, I'm on a bunch of narrowly-focused mailing lists on which you might expect distinctive hammy terms to be repeated pretty frequently. Alas, I don't know of anyone else who has done similar work so I don't know if the peculiarities of my email are significant here. I did that work some months ago and I haven't pursued it since then, partly because I upgraded to a new laptop with ten times the disk space of my pervious one. Another reason that I haven't pursued it is that it turns out that there's no particular need to have a giant database for SpamBayes to work well. Training on some few hundred or a thousand messages is generally quite satisfactory. (Indeed, SpamBayes generally needs to be trained on only a couple of dozen before it starts to get pretty accurate.) Small training sets like those are likely to result in a database that's so small in comparison to modern disks that there's no particular need to work at making it smaller. Beyond that, keeping the last-used date in the database increases its size, slows down scoring, and requires that the database be open for writing when scoring which increases the chances of database corruption in the event of a random crash.

More recently, I've been playing with improving SpamBayes's accuracy by spidering any URLs that appear in a message that has a score in the "unsure" range (for my purposes greater than 0.2 and less than 0.8) and scoring the pages that are retrieved. (I'm aware that Spammers put CGI arguments and other things in their URLs that could conceivably identify me as the recipient; my code goes to some trouble to strip them out.) As far as I'm aware, Richard Jowsey had the idea first and he mentioned it on the SpamBayes mailing list. It has been objected that automatically retrieving those URLs would make it easy for a spammer to orchestrate a DDoS attack on a legitimate webserver. But a spammer could orchestrate a DDoS attack on a legitimate webserver as things stand just by providing a link to labeled "FREE SEX HERE!".

I've found that retrieving the web pages and scoring them is a useful technique. It's not amazingly useful because SpamBayes is sufficiently good that you plain can't improve its accuracy all that much. But every few days a mail will score in the unsure range and sucking down the pages and scoring them as a sort of synthesized message will generally resolve the question correctly. There's a bit of work for me to do yet, though. Some of the words that SpamBayes creates when it sees a message aren't really words at all. Most of these "synthetic tokens" describe certain aspects of the message, such as whether or not it has a message-ID header and whether a particular word appears in the subject. Obviously, most of those aren't appropriate for scoring a web page. They haven't caused an inaccurate result yet, but it's ugly to score something for the wrong reasons even if the result is right. So I'll need to poke through the code and see how easy it is to filter out the inappropriate synthetic tokens before they're used for scoring.

Posted: Tue - November 25, 2003 at 11:04   Main   Category: