Improved SPAM filter

From: Lulu of the Lotus-Eaters (quilty@ibm.net)
Date: Thu, 13 Jan 2000 12:56:57 -0500

Probably like most list members, I get WAY too much email SPAM. I've
gotten in the habit of sending responses back to abuse@whatever.com for
many of these... which may or may not have any effect.

I have used FILTER.EXE to weed out a few recognized spam patterns. But
unfortunately, FILTER is fairly crude: No regex's, searches only on a
few of the header fields (e.g., not against Message-ID), can't compare
header fields to each other (e.g., Message-ID having a different tail
than From indicates likely forgery).

What I'd like to do is create an enhanced filterng program, that looked
for some of the things I thought were likely SPAM patterns. If I get
around to doing it, I'll certainly share it with the world for free (if
it gets good enough to bother). I was wondering if anyone had some
pointers in getting started; someone who has started something similar?
Unfortunately, the existing FILTER.EXE is closed-source, so it is not
easy to look at exactly what it does.

One thought I had on going about this would be to create a SpamFilter
program that would become part of my fetch_mail.cmd script. The idea is
that where I now have a script including:

c:\tcpip\souper\souper.exe -n -i pop3.ibm.net quilty <password>
zip -0m d:\temp\_newmail.zip areas *.msg

I would modify this to have:

c:\tcpip\souper\souper.exe -n -i pop3.ibm.net quilty <password>
SpamFilter *.msg
zip -0m d:\temp\_newmail.zip areas *.msg

What SpamFilter would do in this scenario would be to annotate the MSG
files with something easy for FILTER.EXE to act on. For example, I
might add a header field to identified messages like:

X-SpamFilter: <<Probable Forgery>>

As I imagine it, some sort of SpamFilter RULES files would allow you to
put whatever value you wanted in this X-SpamFilter header whenever
specified conditions were met. From there FILTER.EXE could just search
the whole header for the distinctive value. I figured that using
something like the '<< ...>>' in the field content would make an
accidental occurance elsewhere in the headers pretty unlikely.

Does this seem like a good approach? Any other ideas? If/when I do
this project, I will use a VHLL like Python or Perl (REXX?), since they
have a lot of string-handling and regex functionality built in. That
means that one would need the interpreter, of course, but those are
available on every platform that Yarn is, and more.

Yours, Lulu...

--
---[ to our friends at TLAs (spread the word) ]--------------------------
Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego
White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad
---[ Postmodern Enterprises <quilty@ibm.net> ]---------------------------