Proving Which Spam Filters work Best 263
pirateninja writes "Dr. Gord Cormack decided to find and prove what the best spam filter is. In his study he looked at the major spam filters (DSPAM, SpamAssassin, etc.) along with those submitted by various academics. The results are quite surprising, with a previously unheard-of spam filter, which uses ideas from various compression algorithms, performing the best overall. He recently presented the results and methodology used in a presentation titled 'Spam Filters, Do they Work? and Can you prove it?'" Note that this is a video of his presentation.
Easier? (Score:3, Insightful)
Harder! (Score:5, Funny)
Re:Harder! (Score:5, Funny)
I won't be able to download my internet until Friday now!
Turn that crap down, and get off of my lawn! Damn kids!
Re:Harder! (Score:5, Insightful)
Your college doesn't like bandwidth-efficient delivery? Flood them with a Slashdot effect on a 500mb file, an extra $500 in bandwidth charges, and maybe they'll change their tune.
Re:Harder! (Score:2, Funny)
By chance, are you nearby?
I have a wonderful set of wikipedia tablets I made and I'm eager to offload them...er I mean... trade them.
It's the updates you see, I've been having a bit of a nightmare trying to keep them all in sync.
Re:Harder! (Score:2)
No, I understand the internet is actually a series of tubes, and there will be hell to pay if they get "full".
Re:Harder! (Score:5, Insightful)
Re:Harder! (Score:2, Insightful)
Re:Harder! (Score:2, Insightful)
Yep, you're right. The best long-term information storage media ever invented is poetry.
Re:Harder! (Score:2)
Apparantly because of this there is vast amounts of Sumerian and related texts awaiting translation (the language was only deciphere
Re:Harder! (Score:3, Funny)
Re:Harder! (Score:3, Informative)
Re:Easier? (Score:2)
In my experience... (Score:5, Informative)
Re:In my experience... (Score:3, Insightful)
Re:In my experience... (Score:3, Insightful)
Well, the spammers have heard of the other methods too and try to subvert them. So give them time and see how it performs if and when it becomes more commonly used and the spammers are trying to beat it.
Re:In my experience... (Score:2)
Re:In my experience... (Score:3, Insightful)
First, spam does not need to make sense to make money. Here's some of my latest received headlines:
and the body text (with an attached image):
-----
malware
USDA databases crop
entente cordial: admission relation contract GB giveaway andd
studios another page:
-------
AND IT STILL MAKES MONEY!!!
spam is funded by idiots. we will never run out of idiots on the net. Thus, spam will
Re:In my experience... (Score:5, Funny)
I haven't tested this one myself, Barrett Filter [barrettrifles.com] but I understand it is 100% effective at reducing spam from known sources. False positives may be a problem, however.
Re:In my experience... (Score:2)
oh, wait, you can't use that anymore. Try "Aw, look, they're starvin' to death! We have to thin the herd!"
Re:In my experience... (Score:5, Insightful)
False positives are a HUGE problem compared to the occasional "true negative"(?).
I'd rather have a small trickle of spam emails (I can't believe I'm saying this, but hear me out) than I would risk missing out on that one truly important email.
Re:In my experience... (Score:2, Funny)
Mod as Funny (Score:2)
Re:In my experience... (Score:2)
Re:In my experience... (Score:5, Informative)
http://popfile.sourceforge.net/ [sourceforge.net]
Re:In my experience... (Score:3, Informative)
Re:In my experience... (Score:3, Insightful)
Why not just douse the server in gas... (Score:4, Funny)
Why not just douse the server in gas if you want to see it melt.
Re:Why not just douse the server in gas... (Score:5, Funny)
I'll post more next week after I watch the video.
Re:Why not just douse the server in gas... (Score:2, Informative)
Under present IST policy... (Score:4, Funny)
Torrent (Score:4, Informative)
Re:Torrent (Score:3, Informative)
Combo of SpamAssassin and Spamhaus (Score:2, Interesting)
Re:Combo of SpamAssassin and Spamhaus (Score:2, Insightful)
The key is still: don't give out your address. Once you've done that, you're going to be screwed eventually.
A good DUL helps (Score:2)
Re:Combo of SpamAssassin and Spamhaus (Score:3, Informative)
Re:Combo of SpamAssassin and Spamhaus (Score:2)
And turn off SMTP VRFY.
SMTP VRFY (or recipient-checking at the SMTP level in general) being disabled is pointless. Given a choice between allowing people to not send mail to invalid addresses or having to deal with bounce-back scatter and getting your MX server blacklisted for third-party spam, I'll take the former any day.
And I'd wager anyone who's had to admin a qmail server and decide which (if any) recipient-checking patch to use would feel the same way.
It's far less load on the servers to have a more e
Re:Combo of SpamAssassin and Spamhaus (Score:3, Insightful)
Or, as in my case, you could assume that a university you apply to will not send out a giant mass email to all the incoming graduate students inviting them to the graduate orientation. So now I have the email address of every grad student entering the Univ
Re:Combo of SpamAssassin and Spamhaus (Score:3, Funny)
Nah, that's such a half measure. The real solution is to not have an email address at all.
Re:Combo of SpamAssassin and Spamhaus (Score:2)
Amazing! - We've been using that combo for a long time and I get about 5-10 spams AN HOUR coming through the filters (and about the same amount caught). This is all personalized spam sent to one specific email address. That address was used in the past for a few newsgroup postings, a few technical forums and it was listed on a webpage some time ago. No spam sent to it
Fantastic Spam Filters Which Work Best Proving! (Score:5, Funny)
Hey Slashdot, what's up, man! Dude, I read your thing and like totally agree about Best Work Proving Spam Site Work! Dude, that's awesome!
Bro, in the same vein, I was totally checking out this dope ass site [microsoft.com] which you might wanna check out [doubleclick.net] too man. Guys like us that dig Spam Which Proving and Best work Filters will be all over this before long...
OK, man take care until I see you this Friday at the dinner thing, Slashdot!
Cheers,
John
Re: Very Interesting And Generally Really Amusing (Score:5, Funny)
Having trouble pleasing your woman? I've got something Very Interesting And Generally Really Amusing that you could try!!!
Your buddy,
_vAnoymousCoward
Amusingly, POPFile caught you (Score:5, Interesting)
RTFA? (Score:5, Insightful)
Re:RTFA? (Score:5, Funny)
Re:RTFA? (Score:2)
Re:RTFA? (Score:2)
Not surprising... (Score:4, Insightful)
If they aren't used widely, it would either be because they don't work, or they do work but they haven't caught on [yet].
It's like any other fad. As an example, when the original Survivor series came out, it was really popular because it achieved its goal (attracting viewers) in a way that was original. Heck, even I watched the original one. Now that all the networks are doing the reality TV thing, it has become hackneyed, and each successive version of survivor does a worse job of achieving its goal. And I've given up watching TV.
With antispam, new techniques are effective, but as they become more popular and more widely used, spammers will find equally innovative ways of getting around them.
I've noticed that at any given time, there will be a particular style of (non-blank) spam that manages to get through Gmail's filters fairly consistently, but every now and then Gmail adapts its spam filters to block the successful spam type of the season, and eventually a new type will make its way through.
- RG>
Re:Not surprising... (Score:2)
My office went from 2000 spam mails a day to about 10. across 15 employees. Who gives a crap about the 10 emails remaining...
I only wish it could be taken care of upstream further to shut those pricks down. but for the end user in an admins perspective, most systems are pretty easy to deal with (particularly small offices)
Got to go with Brightmail (Score:5, Informative)
I also echo a gripe of other posters. Its nice to have a video but 500MB video file it a bit much. A 50KB pie chart or bar graph would have been nice.
Re:Got to go with Brightmail (Score:3, Informative)
And what happened when you retrained those false positives as ham? Did you see future mails of the same/similar type get caught again? I bet you didn't.
I've been using dspam for a very long time for my users, and they love it. They love having zero spam in their mailbox, they love the simplicity of the user interface. They love how it treats users on a per-user basis, not globally (i.e. so
Flaw in the test (Score:5, Informative)
As with most choices like this, factors such as ease of use, speed, and resource efficiency can overshadow selectivity. No system is perfect, so it's perfectly reasonable to go with a system that's pretty good if you already are using it, rather than switching to the latest cool thing.
I have found that using two dissimilar systems in a chain is quite effective.
Re:Flaw in the test (Score:2)
And that applies to spam filtering techniques as well - it's like anti-biotics. For serious stuff, a spread attack is a good idea.
I've found that using RBLs, SpamAssassin, and Bayesian filters prevents 99.5% of spam with essentially no false positives. And that means, by my day-to-day experience with addresses spammed for a full 10 years now, that instead of getting 100 spam and one real mail, I get 1 real mail, and once every could of days a spam that gets through.
Except for earlier this ye
Re:Flaw in the test (Score:3, Insightful)
Lately, I've been thinking about this problem a lot. The classic method of computer classification systems (Bayes, SVM, whatever) are all base
Re: (Score:2)
Re:Flaw in the test (Score:2)
Theo van Dinter added a rule to catch these to the core rules on Tuesday.
Re:Flaw in the test (Score:3, Insightful)
obscurity (Score:2)
Re:obscurity (Score:2)
I have a successful spamfilter deployed at work. It uses SpamAssassin for the backend filtering, but that part has to do very little.
The bulk of the rejecting is done in the dedicated SMTP engine that receives the mail. There is a lot of information to be deduced from the SMTP transaction itself, which is normally not used by spamfilters.
Close adherence to RFC standards is something that most SMTP servers have achieved quite well, and the tools the spammers use are very bad at it.
I know
I got the 400M download! (Score:4, Funny)
text versions of the material (Score:5, Informative)
The official tests of spamfilters were done in last year's TREC conference, you can read the writeup here [uwaterloo.ca] (or pdf overview [uwaterloo.ca]).
You can duplicate those tests yourself if you download the evaluation toolkit (GPL) [uwaterloo.ca]. It's a modular system where you can add a mail corpus (either one of the public TREC ones, or you can make your own trivially), and add a spamfilter package (there are 10 or so to download from the web, or create your own as per documentation).
There's also a video talk [researchchannel.org] given at Microsoft research which should cover pretty much the same ground, if text mode is slashdotted :).
There's a new scheduled test towards the end of the year at TREC 2006.
Re:text versions of the material (Score:2)
IMarv
Only one question... (Score:2)
Re:Only one question... (Score:3, Insightful)
If your mail is that important, you should be using couriers instead of email.
Human classification is not zero risk (Score:2)
Re:Only one question... (Score:2)
Eivind.
Re:Only one question... (Score:2)
You could have it only filtered completely if it's suspect rating is high enough and then otherwise just tag it if the rating is below a certain point.
That said... white lists are your friends.
Funny thing though... someone forwarded me some "funny" e-mail and usually they are not that humorous. I was so damned pleased when it was filtered out.
That said, I haven't moved to deletion just yet. I just tag the mail and sort it later. As soon as I'm sufficiently happy with the system highly suspect mails can
Re:Only one question... (Score:2)
On the efficiency side it has a hit rate of nearly 100%. I would have said it was 100% a couple of months back, but ju
Ask Slashdot ... (Score:5, Funny)
At the university where I work, they have recently adopted a pesky policy banning the use of bitTorrent.
What can I do to fix [uwaterloo.ca] this ?
Yours faithfully,
Dr. Gord Cormack
Argh! Gratuitous Video! (Score:2, Insightful)
Good job the I don't filter web content (Score:2, Funny)
Spam about asian donkeys is a new one on me, though.
MS Anti Spam... (Score:2)
Re:MS Anti Spam... (Score:2)
Usually, anti-spam solutions which give more than 1:100000 are considered worthless. What you're quoting is beyond words.
Re:MS Anti Spam... (Score:2)
No, better than 1:100 - that's what <1% means. It's actually around the 1:500
Usually, anti-spam solutions which give more than 1:100000 are considered worthless
Got links, or is that just your opinion?
Re:MS Anti Spam... (Score:3, Informative)
And thus still 200 times worse than the acceptable rate.
There was a massive flamefest on debian-devel about spam filtering recently, but false positive ratios in that range were something commonly used by most participants in the discussion. I don't have the time to
So what is the previously unheard of spam filter? (Score:2)
No bittorrent... No credibility (Score:5, Insightful)
This guy should spend his time educating the fools at his institution.
Possible Text Version (Score:4, Informative)
Gordon Cormack and Thomas Lynam
Full Text, May 29, 2006 - PDF Format
http://plg.uwaterloo.ca/~gvcormac/spamcormack.html / [uwaterloo.ca]
Re:Possible Text Version (Score:4, Informative)
Fidelis Assis (who has now gone solo after having participated in the CRM114 project) shows great results for his recent solo effort: OSBF-lua [luaforge.net] Bratko's PPM spam filter [ai.ijs.si] -- the one that did great at TREC -- is not yet packaged as a drop-in filter. Same for my DMC spam filter [www.ceas.cc].
The actual TREC 2005 tests referred to in TFA are here. [uwaterloo.ca]
GMail Spam Filter (Score:5, Interesting)
I use greylisting (gld to be specific) which works wonderfully. A couple of customers wanted even better filtering...
First I tried DSPAM, but they refused to train it so the results weren't good. Then I tried Spam Assasin, which also let through a suprising amount of spam - a lot more than my personal account on Gmail.
So I set up accounts on Gmail for them, and forwarded their mail to those accounts (after greylisting - don't want to burden GMail too much!). Gmail lets you set up forwarding, so I simply forwarded all the filtered mail back to a second account on my mailserver for the customer to pick up. Finally I wrote a python script that logs in to Gmail once a week to prevent the account being closed due to non-use.
A tad involved, but it works like a dream. Yet again Google comes out on top, this time in a market it doesn't even know it's in!
Re:GMail Spam Filter (Score:2, Interesting)
This is actually something Google could sell. Access to their mail filter. I do realize that they have "corporate email", but that still smacks a lot of GMail and some businesses would rather avoid that. Instead, they could provide a simple access to their spam filter. Yes, requiring all email to be piped through a Google server if they don't want to make the filter available as a binary (presumably updated regularly).
To minimize bandwidth consumption and (partly, at least) allay privacy / corporate secre
So Which One Won? (Score:2, Interesting)
Wouldn't it make sense to put this in the
Did I miss the obvious "and the winner is..." some place?
Cloudmark's SpamNet (Score:3, Interesting)
On the efficiency side it has a hit rate of nearly 100%. I would have said it was 100% a couple of months back, but just recently it's been having a bit of a problem with one stock-pushing spam.
Anyway, that aside it's the best spam filter I've ever seen by a very long way, and I'd highly recommend the service. It costs a few $ a month, but it's probably the best value subscription I have.
I have no connection with the company, just a very satisfied customer who's been using it since the beta some years ago. I have a publically available email address which I've had for years and must be on many spam lists, without Cloudmark it would be unusable, with it it's no problem at all. I recently installed it for my wife who was starting to get a lot of spam - on that I noticed it took about two weeks to get it trained not to junk a few mailing list emails she was on, but after that it's been just as highly reliable as my installation.
Best spam filter. (Score:2)
Feel free to add to it.
Out of Date and Worthless (Score:5, Informative)
He tested spamassassin 2.3 - that's ancient! I'd imagine the other tools are similarly obsolete.
We currently use SA 3.1.4 with a well-trained Bayes database and Razor, Pyzor, and DCC.
Throw in a few custom rules and a selection of rules from http://www.rulesemporium.com/ [rulesemporium.com] and the results are outstanding.
With the new sa-update feature the core rules are updated between point releases, which came in useful this week dealing with the new image spams which seemed to be designed to avoid detection by spamassassin. Thanks Theo.
And the folk on the spamassassin-users mailing list really rock.
Re:Out of Date and Worthless (Score:4, Informative)
TREC 2006 [nist.gov] evaluations are now underway [uwaterloo.ca].
While it is reasonable to conjecture that spam has changed so as to defeat spam filtering techniques, or will change so as to defeat the PPM technique that did well at TREC, the historical evidence does not support this conjecture. In particular:
It is a war (Score:3, Insightful)
By the time that I have downloaded the video the war will have moved on a couple of iterations ...
Way to go compression ! (Score:2, Interesting)
Paul Vixie on botnets and spam (Score:3, Interesting)
The key paragraph:
If you'd like a more topical example, consider "spam". People began altering their e-mail "From:" lines in order to make their addresses harder to guess or aggregate; people began doing pattern matching in order to catch known-bad messages and either sideline or reject them. Many defenders used many small tricks to protect their inboxes. The result has not been that less spam is sent or even that less spam is received, on an aggregate basis. Things are worse now than they've ever been. (I say this as co-founder of MAPS LLC, by which I hope to establish my credentials in the spam field for those of you who do not know me.) Today a small number of highly advanced defenders is spam-immune only because they are a small number and their techniques are not widely effective against the attackers; and a small number of highly advanced attackers can "spam at will" a far larger population than ever before. And the trend is that things are getting worse, and getting worse faster than ever before.
Dspam floats my boat (Score:3, Informative)
What about Greylisting? (Score:2)
IMarv
Slides from the presentation (Score:3, Informative)
Re:Torrents (Score:3, Interesting)
Re:fuck power went out! (Score:2, Funny)
Re:I have one word: (Score:3, Informative)
Re:Spam Ass Asian? (Score:2)
Re:Little known systems will often be most effecti (Score:2)
And if it started getting worse you could move to tassie and get that feeling of irrelevancy back.
Re:Give grey listing a try... (Score:2)
Re:Why do they try? (Score:2, Insightful)
However, its very often the end user's ISP doing the spam filtering - and this has no direct bearing on the gullibility of the email recipient.