[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: spam filtering
- To: http://www.gnosis.cx/~mertz
- Subject: Re: spam filtering
- From: http://dummy.us.eu.org/robert (Robert)
- Date: Wed, 2 Oct 2002 16:35:20 -0400
- In-Reply-To: <http://www.gnosis.cx/~nB1m9kKkXs2G092ynz>
- Keywords: spam http://www.gnosis.cx/~mertz
> From: http://www.gnosis.cx/~mertz (David Mertz, Ph.D.)
> Date: Wed, 02 Oct 2002 16:00:39 -0400
>
> |I tried downloading your code, but it wouldn't work. What version of
> |python do you use?
>
> This is funny, did you not get my prior message?
No. I'm not sure what happened. I wonder if I should worry...
> ------------------------------------------------------------------------
> To: http://dummy.us.eu.org/robert (Robert)
> Subject: Re: spam filtering
> Date: Tue, 01 Oct 2002 17:59:02 -0400
>
> |bogofilter is still in its infancy. It has potential. (I admit that
> |ifile ends up filtering on some funky "words" -- it ends up that words
> |like "<table" and "<font" are indicators of spam for my 5000 spam
> |messages
>
> Actually, '<table' seems like a perfectly good "word". What I'd rather
> discount is something like 'MtYKC46lkd' (I took that from your GPG
> signature; it was surrounded by "+" bytes, which might identify it as a
> word to some lexers). The point of a good lexer is to eliminate "words"
> that will never occur anywhere else, not necessarily to get the things
> that you would look up in a dictionary. Well, maybe also to count as
> single words special strings like URLs.
>
> I don't know what ifile does about this...
It definitely doesn't handle URLs. But it's smart about pairing down
excessively repeating words and excessively few words. (At least, it
seems that way.)
> but since I wasn't quickly
> able to tell what it did from its web page, it quickly fell off my "need
> to mention it" list.
>
> |That's interesting. You've piqued my interest. I tried downloading your
> |code, but it wouldn't work. What version of python do you use?
>
> I used 2.2. That's needed since I used generators. It would be easy to
> change the code not to do this though.
>
> What do you mean by "wouldn't work" though. If the problem was the
> "yield" keyword, the error should have been awfully straightforward. If
> the problem was something else... well, who knows.
% ./spam-test.py
File "./spam-test.py", line 15
product *= p
^
SyntaxError: invalid syntax
I think I'm running Python 1.5. (Debian Linux is far behind, unfortunately.)
> As you can see, I didn't exactly try hard to make the code polished or
> reusable. I don't think its bad... but basically I wrote it to test a
> hypothesis rather than to create a general purpose tool. Not that it
> would be hard to write something more complete...
>
> |Finally, just a note: your message was marked as spam by my filter because
> |there are spaces at the end of your Message-ID: line.
>
> Yuck! This isn't ifile that did this, is it?
No. I use SpamBouncer (http://spambouncer.org) and it has what I believe
to be a bug. I told the author about it.
> (it doesn't sound like a
> Bayesian thing). In any case, that's a really terrible filter
> criterion. Not that I'm sure why my ID had spaces... my mailer doesn't
> normally do that... but maybe I accidentally added something in the
> header area. Still, sure sounds RFC2822 friendly to me (and not even
> something I've ever noticed spammers doing... why would they bother?).
>
> Yours, David...
>
> --
> _/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: Postmodern Enterprises _/_/_/
> _/_/ ~~~~~~~~~~~~~~~~~~~~[http://www.gnosis.cx/~mertz]~~~~~~~~~~~~~~~~~~~~~ _/_/
> _/_/ The opinions expressed here must be those of my employer... _/_/
> _/_/_/_/_/_/_/_/_/_/ Surely you don't think that *I* believe them! _/_/