May 03, 2003

Spam filtering

Spam is becoming more of a nuisance for everyone who uses email. For some people it's starting to hurt the functionality of their email. Blocking addresses and other header-based filtering techniques don't work very well, as spammer change addresses/servers/subjects. One thing they can't really change too much is their content. If a spammer want to say 'Click here for a prize,' there are only so many ways he/she can reword that. A content-based filter that actually looks at all the words in the email would do a much better job of figuring out what is spam and what isn't.

That leads us to Bayesian spam filtering. By looking at how often a word appears in both spam and normal emails, you can say that if that word appears again it's more likely to be the one in which it appeared more frequently. A message with 'click' and 'here' is more likely to be a spam message than a normal message. There are plenty of software packages out there that do this statistical content-based filtering. An easy to use one is Mozilla's mailer, but Mozilla is on it's way out, and you'll have to wait for the new standalone mail client. So I'll tell you about the one that I use. POPfile is a cross-platform POP3 proxy that can run in between you and your mail server. When you check your mail, it'll go you server and fetch it, then decide whether it's spam or not (based on email that you have trained it with). It adds a tag to the email that you can filter by in the client. The POPfile homepage has good intructions on how to download and setup things, bu if you need an extra advice you can just email me.

The filter works really well for me; it gets about 96% of it's classifications correct. Also you don't have to just sort between spam and normal email. I divided mine up between spam, normal email, and informational messages I get from the university, which I usually don't read, but are interesting on occasion. The program is very customizable and is in continuous development. It'll really helps you if you have a spam problem.

Posted by ramk at May 3, 2003 01:23 PM
Comments
Post a comment













Save this info?