tlug: despam - a report on a spam blocker

To: tlug@example.com
Subject: tlug: despam - a report on a spam blocker
From: "Stephen J. Turnbull" <turnbull@example.com>
Date: Wed, 24 Sep 1997 10:42:59 +0900 (JST)
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <199709191324.WAA12806@example.com>
References: <199709191324.WAA12806@example.com>
Reply-To: tlug@example.com
Sender: owner-tlug

--------------------------------------------------------
tlug note from "Stephen J. Turnbull" <turnbull@example.com>
--------------------------------------------------------
>>>>> "Jason" == Jason Molenda <crash@example.com> writes:

    Jason> I installed a spam blocker here in Tokyo called 'despam'
    Jason> a week ago.  It is a perl script which which includes a
    Jason> large database of regular expressions to detect spam mail
    Jason> notes (it looks through the headers or body of mail notes
    Jason> for certain regular expressions).  It has something like
    Jason> 1,500 or 2,000 regexps it checks against.

Yikes!  presumably a descendent of the system used by the Cancel-Moose 
on Usenet?

    Jason> The merit of any of these systems is how well they block
    Jason> the spam.  I kept track of things for 9 days.  Over that
    Jason> period, period I was sent 117 spams, 79 of which despam
    Jason> caught (and 38 of which got past it).  Some of these 117
    Jason> spams were duplicates; I counted all of them as individual

I lost my recent mail archives to a disk crash, so I'm just
reconstructing from memory.  But at one point I had 500 or so, and I'm
sure I was doing better than 65%.  I use _one_ procmail regexp on
_headers only_.  It has wrapped around about 5 times on an 80-column
window by now, of course.

It's true that most of the ones that get through the filter are MLM
pyramid swindles.   However, I'm pretty sure I know how to catch most
of those although I haven't implemented it yet, and it may require
going out of procmail:  check for a mismatch in the "Received:" chain
(especially if there's an intervening "From:").  Come to think of it,
lots of MTAs now include a "possible spoof" notice in the headers;
filtering on that will catch them in many cases (but it'll also catch
Jim Schweiz when he's fiddling with his mailer config :-).

    Jason> spams.  Two messages were marked as spam, but were not
    Jason> spam.  They were digests (the nikon-digest mailing list)
    Jason> which had spam in them, so I'm not holding that against
    Jason> despam.

despam should check for digests.  That's not acceptable to me.

    Jason> So I'm pretty happy with the results of despam so far.  One
    Jason> drawback of it is that it does eat some CPU time as it goes
    Jason> through the headers and body of incoming mail notes for all

If I understand your description correctly, _and_ you are already
using procmail, one thing you can do is to keep a list of your regular
individual correspondents and trusted-not-to-spam domains and put them
_ahead_ of the despam call in .procmailrc.  Also digests, where the
cost of the spam may be lower, and the probability of a multiple false 
positive is high.

You can get the multiples sort of for free by keeping a spam-cache of
message IDs (see the procmail docs, I know it's possible but not how),
and filtering on the cache before using despam.  This may require
altering despam (it would probably have to call formail, the procmail
tool which maintains message ID caches).  This cache would be small
because the multiples would all arrive within 24 hours, most likely,
so you can expire the cache rapidly.

    Jason> of these regexps.  Another drawback is that the spam block
    Jason> patterns are tied to the releases of despam, so I'm not
    Jason> sure how frequently updated patterns will be released.

Well, that requires analysis of the spams, so it's mostly going to be
useful against pyramid swindles.  There aren't any new ones :-)

You can also make a private spam-blocker like mine, look for something 
suspicious in the headers and add it to the spam regexp in
.procmailrc.  In Emacs I use the following procedure:

; mark suspicious domain or address or other feature
; eg "cool.out.do.you.know.where.the.delete.key.is" in Message-ID.
M-w				; save it
C-x C-f "~/.procmailrc" RET	; cheap if the buffer already exists
M-< M-s "abuse/newmail" RET	; I append spam to an abuse inbox
				; use mh-inc and mh-scan to check for
				; non-spam, mv them to a safe place,
				; then delete the spam files
C-a C-b				; back up to end of regexp
"|"
C-u C-y				; yank, leaving point at head
C-x n n M-x "replace-string" RET "." RET "\."	; narrow and quote dots
C-x n w C-x C-s			; widen and save

Note that there are no user inputs after the region is defined; this
could easily be turned into a macro or a function (I haven't bothered
but if you want to distribute to non-hackers).  I don't know how you
would do this in Eudora or MS Exchange ....

Avoid the temptation to put "prodigy", "aol", "compuserve", and "tlug" 
into your regexp.

The better your regexp is, the less often despam gets called.

HTH

Steve
Next TLUG meeting is Saturday October 11, 1997
-----------------------------------------------------------------
a word from the sponsor will appear below
TWICS - Japan's First Public-Access Internet System.
www.twics.com  info@example.com  Tel:03-3351-5977  Fax:03-3353-6096

References:
- tlug: despam - a report on a spam blocker
  - From: Jason Molenda <crash@example.com>

Prev by Date: tlug: tk4.2-jp and japanese characters
Next by Date: Re: tlug: Locale problem
Prev by thread: tlug: despam - a report on a spam blocker
Next by thread: tlug: Swedish characters in tcl/tk-jp
Index(es):
- Date
- Thread

Home | Main Index | Thread Index