Proving Which Spam Filters work Best

Become a fan of Slashdot on Facebook

Proving Which Spam Filters work Best 263

Posted by samzenpus on Thursday August 03, 2006 @12:14AM from the get-rid-of-it dept.

pirateninja writes "Dr. Gord Cormack decided to find and prove what the best spam filter is. In his study he looked at the major spam filters (DSPAM, SpamAssassin, etc.) along with those submitted by various academics. The results are quite surprising, with a previously unheard-of spam filter, which uses ideas from various compression algorithms, performing the best overall. He recently presented the results and methodology used in a presentation titled 'Spam Filters, Do they Work? and Can you prove it?'" Note that this is a video of his presentation.

This discussion has been archived. No new comments can be posted.

Proving Which Spam Filters work Best

Load All Comments

Search 263 Comments Log In/Create an Account

Comments Filter:

Easier? (Score:3, Insightful)

by Ec|ipse ( 52 ) writes: on Thursday August 03, 2006 @12:22AM (#15837338)

Isn't there an easier way to display the results, liek a chart or something. 400M per file download is a bit extream.

Share
twitter facebook
- Harder! (Score:5, Funny)
  
  by Profane MuthaFucka ( 574406 ) writes: <busheatskok@gmail.com> on Thursday August 03, 2006 @12:53AM (#15837466) Homepage Journal
  
  I uuencoded the video file, translated it into Sumerian cuneiform, and pressed it into a billion little clay tablets. They are cooking in my oven right now. Now, the Internet is NOT some kind of truck you can just dump stuff onto, so if you want to get the data you're going to have to come to my house.
  
  Parent Share
  twitter facebook
  - Re:Harder! (Score:5, Funny)
    
    by rts008 ( 812749 ) writes: on Thursday August 03, 2006 @01:05AM (#15837509) Journal
    
    I can't come to your house, you insensitive clod!, teh tubes are clogged with clay tablets!
    
    I won't be able to download my internet until Friday now!
    
    Turn that crap down, and get off of my lawn! Damn kids!
    
    Parent Share
    twitter facebook
    - Re:Harder! (Score:5, Insightful)
      
      by Squalish ( 542159 ) writes: <Squalish AT hotmail DOT com> on Thursday August 03, 2006 @08:30AM (#15838679) Journal
      
      Am I the only one that read the means of presentation as a hilarious attack on a university policy of blocking bittorrent? Given that adding 470MB doesn't really add any usable information to a discussion about spam filters over a piece of text, and all.
      
      Your college doesn't like bandwidth-efficient delivery? Flood them with a Slashdot effect on a 500mb file, an extra $500 in bandwidth charges, and maybe they'll change their tune.
      
      Parent Share
      twitter facebook
  - Re:Harder! (Score:2, Funny)
    
    by Cylix ( 55374 ) writes:
    
    Excellent...
    
    By chance, are you nearby?
    
    I have a wonderful set of wikipedia tablets I made and I'm eager to offload them...er I mean... trade them.
    
    It's the updates you see, I've been having a bit of a nightmare trying to keep them all in sync.
  - Re:Harder! (Score:2)
    
    by ciscoguy01 ( 635963 ) writes:
    
    Now, the Internet is NOT some kind of truck you can just dump stuff onto, so if you want to get the data you're going to have to come to my house.
    
    No, I understand the internet is actually a series of tubes, and there will be hell to pay if they get "full".
  - Re:Harder! (Score:5, Insightful)
    
    by cruachan ( 113813 ) writes: on Thursday August 03, 2006 @05:43AM (#15838190)
    
    Don't knock it, cuneiform on backed clay is the single most successful format for long-term storage ever invented - 3000 years and counting. Heck, most of our modern storage formats can't even manage 30 - tied to read a 8" floppy recently?
    
    Parent Share
    twitter facebook
    - Re:Harder! (Score:2, Insightful)
      
      by Jartan ( 219704 ) writes:
      
      I'm not going to knock it but your statement is very far from the truth. Determining the "most successful" long term storage method invented would require waiting till the year 5xxx something to see if something we've currently invented beats cuneiform. Even then it's pretty hard to prove one way or another since a lot of the cuneiform we have today is being carefuly taken care of to prolong it's lifetime I'd suspect (though I have no confirmation of that part).
      - Re:Harder! (Score:2, Insightful)
        
        by ozmanjusri ( 601766 ) writes:
        
        I'm not going to knock it but your statement is very far from the truth.
        Yep, you're right. The best long-term information storage media ever invented is poetry.
      - Re:Harder! (Score:2)
        
        by cruachan ( 113813 ) writes:
        
        Presumably the tablets stuff in museums are, but I was listening to an archeologist recently who was saying that the advantage of clay tablets is that they are virtually indestructable unless you purposly take a hammer to them. Compare that to papyrus and parchment which can get preserved a long time, but need special circumstances, such a being buried in a bog, to stop decay.
        
        Apparantly because of this there is vast amounts of Sumerian and related texts awaiting translation (the language was only deciphere
        
        Re:Harder! (Score:3, Funny)
        
        by Crayon Kid ( 700279 ) writes:
        
        Bwahaha, I'm moving my blog to clay tablets. They will undoubtedly survive the next Ice Age and the people of year 5000 will be forced to read about my cat, how I hate Emo's and that guy at work who doesn't wash. But first I'll change my blog nick to "Earth Imperial Overlord Supreme", just to fuck with them future dudes.
    - - Re:Harder! (Score:3, Informative)
        
        by cruachan ( 113813 ) writes:
        
        True, but as I per below, there's literally mounds of baked clay tablets because they are so indestructable. Apparently they used to get shovelled into foundations and the like. The estimate I heard was that at current rates it will take scholars several hundred years to translate what we've found already. Compare that to parchment records where the discovery of even a few new scraps is a major event (http://news.bbc.co.uk/1/hi/sci/tech/5235894.stm and particularly http://news.bbc.co.uk/1/hi/world/europe [bbc.co.uk]
- Re:Easier? (Score:2)
  
  by eric76 ( 679787 ) writes:
  
  They could have just e-mailed it to everyone with a gmail account.
In my experience... (Score:5, Informative)

by vivin ( 671928 ) writes: <vivin.paliath@g[ ]l.com ['mai' in gap]> on Thursday August 03, 2006 @12:24AM (#15837340) Homepage Journal

... the ones which have worked best (for me) are Bayesian Spam Filters [wikipedia.org] (A Plan for Spam [paulgraham.com], SpamBayes - a free filter [sourceforge.net]) and CRM114 [sourceforge.net] The Controllable Regex Mutilator (Paul Graham mentions it here [paulgraham.com]). I've always had a very high success rate with these.

Share
twitter facebook
- Re:In my experience... (Score:3, Insightful)
  
  by coffeeisclassy ( 991791 ) writes:
  
  Whats surprising is, while Bayesian spam filters work well in his tests, the one that performs the best was never really heard of before.... I wonder how long it will be before we see something using the methods available, who wants to bet OpenSource will beet closed source to implementing this?
  - Re:In my experience... (Score:3, Insightful)
    
    by 1u3hr ( 530656 ) writes:
    
    Whats surprising is, while Bayesian spam filters work well in his tests, the one that performs the best was never really heard of before
    Well, the spammers have heard of the other methods too and try to subvert them. So give them time and see how it performs if and when it becomes more commonly used and the spammers are trying to beat it.
    - Re:In my experience... (Score:2)
      
      by marcello_dl ( 667940 ) writes:
      
      yep but good luck trying to defeat different algorithms and still retaining some sense, let alone a convincing message. Unless people are going to trust a sender named "Honey bee furufuru", which unfortunately is still entirely possible.
      - Re:In my experience... (Score:3, Insightful)
        
        by jank1887 ( 815982 ) writes:
        
        Hello. welcome to the internet.
        First, spam does not need to make sense to make money. Here's some of my latest received headlines:
        
        placing LEDhas
        pJapans mission
        capture Todays architect shared
        6MZ
        and the body text (with an attached image):
        -----
        malware
        USDA databases crop
        entente cordial: admission relation contract GB giveaway andd
        studios another page:
        ... (etc.,etc.)
        -------
        AND IT STILL MAKES MONEY!!!
        spam is funded by idiots. we will never run out of idiots on the net. Thus, spam will
- Re:In my experience... (Score:5, Funny)
  
  by ozmanjusri ( 601766 ) writes: <aussie_bob@hoMOSCOWtmail.com minus city> on Thursday August 03, 2006 @12:30AM (#15837378) Journal
  
  I've always had a very high success rate with these.
  I haven't tested this one myself, Barrett Filter [barrettrifles.com] but I understand it is 100% effective at reducing spam from known sources. False positives may be a problem, however.
  
  Parent Share
  twitter facebook
  - Re:In my experience... (Score:2)
    
    by emag ( 4640 ) writes:
    
    Just repeat after me... "They're comin' right for us!"
    
    oh, wait, you can't use that anymore. Try "Aw, look, they're starvin' to death! We have to thin the herd!"
  - Re:In my experience... (Score:5, Insightful)
    
    by KlaymenDK ( 713149 ) writes: on Thursday August 03, 2006 @05:47AM (#15838197) Journal
    
    "False positives may be a problem, however."
    
    False positives are a HUGE problem compared to the occasional "true negative"(?).
    
    I'd rather have a small trickle of spam emails (I can't believe I'm saying this, but hear me out) than I would risk missing out on that one truly important email.
    
    Parent Share
    twitter facebook
    - Re:In my experience... (Score:2, Funny)
      
      by cerberusss ( 660701 ) writes:
      
      Ummm, you didn't get the joke. Read it again, Sam.
  - Mod as Funny (Score:2)
    
    by vivin ( 671928 ) writes:
    
    INFORMATIVE? Mod the parent FUNNY, please.
- Re:In my experience... (Score:2)
  
  by TorKlingberg ( 599697 ) writes:
  
  SpamAssassin uses Bayesian Filtering as well as other methods.
- Re:In my experience... (Score:5, Informative)
  
  by Red Alastor ( 742410 ) writes: on Thursday August 03, 2006 @02:00AM (#15837652)
  
  I like popfile because it's a bayesian filter that sorts into any arbitrary categories you want, not just spam and ham.
  http://popfile.sourceforge.net/ [sourceforge.net]
  
  Parent Share
  twitter facebook
- Re:In my experience... (Score:3, Informative)
  
  by ajs ( 35943 ) writes:
  In my experience, the commercial offerings (such as mail frontier) aren't too bad. As far as open source stuff, my personal setup of choice is:
  
  Spamhaus SBL/XBL filtering (hard SMTP-time DNSBLing) based on my expereince with them and their consistent listing of VIOLATORS, not just anyone who shares a netblock with a spammer (i.e. they may not catch as much as some others, but they don't have the FP rate that others do)
  Greylisting. This is controversial because many people can't tolerate the delay it introdu
- - Re:In my experience... (Score:3, Insightful)
    
    by I!heartU ( 708807 ) writes:
    
    Domain keys... now just get everyone to use it.
Why not just douse the server in gas... (Score:4, Funny)

by shotgunefx ( 239460 ) writes: on Thursday August 03, 2006 @12:25AM (#15837343) Journal

400MB?

Why not just douse the server in gas if you want to see it melt.

Share
twitter facebook
- Re:Why not just douse the server in gas... (Score:5, Funny)
  
  by Tsiangkun ( 746511 ) writes: on Thursday August 03, 2006 @12:28AM (#15837368) Homepage
  
  I'm getting 8kb/s downloads from the site, it's just like the good old days !
  
  I'll post more next week after I watch the video.
  
  Parent Share
  twitter facebook
  - Re:Why not just douse the server in gas... (Score:2, Informative)
    
    by coffeeisclassy ( 991791 ) writes:
    
    Its round robin mirrored accross a whole bunch of different servers so if youre only getting 8kb/s you could try cancelling and downloading again and seeing if it goes faster.
- Under present IST policy... (Score:4, Funny)
  
  by patio11 ( 857072 ) writes: on Thursday August 03, 2006 @12:39AM (#15837407)
  
  ... they are not allowed to douse the servers in gas.
  
  Parent Share
  twitter facebook
- Torrent (Score:4, Informative)
  
  by vivin ( 671928 ) writes: <vivin.paliath@g[ ]l.com ['mai' in gap]> on Thursday August 03, 2006 @06:55AM (#15838360) Homepage Journal
  
  Here [vivin.net] is a torrent I made of the xvid file. It should work (I hope).
  
  Parent Share
  twitter facebook
  - Re:Torrent (Score:3, Informative)
    
    by wayne ( 1579 ) writes:
    
    Your tracker is still 440'ing, so I have put up an alternative tracker [schlitt.net]. As I write this, I only have about 9% of the avi downloaded, so if someone else can seed the complete cormack-spam-xvid.avi file, I would greatly appreciate it.
Combo of SpamAssassin and Spamhaus (Score:2, Interesting)

by hyperion454 ( 766214 ) writes:

At work we've set up a combination of SpamAssassin and Spamhaus. Personally I've went from about 10 spams per day to about 1 every two weeks.
- Re:Combo of SpamAssassin and Spamhaus (Score:2, Insightful)
  
  by b0r1s ( 170449 ) writes:
  
  Bah. We use Spamassassin, multiple DNSBLs, and I still get hundreds per day, most of them to addresses published on websites (unavoidable).
  
  The key is still: don't give out your address. Once you've done that, you're going to be screwed eventually.
  - A good DUL helps (Score:2)
    
    by winkydink ( 650484 ) * writes:
    
    DUL = DailUp List... a bit of a misnomer as it commonly refers to all dynamic hosts. My spam went down dramatically after starting to use Trend's DUL (formerly MAPS). Alas, it's a pay service, but it all comes down to your pain threshold. Mine is low relative to my income.
  - Re:Combo of SpamAssassin and Spamhaus (Score:3, Informative)
    
    by emag ( 4640 ) writes:
    
    And turn off SMTP VRFY. Either that, or having windows systems @ my ISP managed to get the address associated with my account on spam lists. This is an address that's *only* used internally by my ISP (I use pobox or my own domain whenever someone asks for an address). Even that wasn't enough to provent it from getting harvested. :-(
    - Re:Combo of SpamAssassin and Spamhaus (Score:2)
      
      by Etcetera ( 14711 ) writes:
      
      And turn off SMTP VRFY.
      
      SMTP VRFY (or recipient-checking at the SMTP level in general) being disabled is pointless. Given a choice between allowing people to not send mail to invalid addresses or having to deal with bounce-back scatter and getting your MX server blacklisted for third-party spam, I'll take the former any day.
      
      And I'd wager anyone who's had to admin a qmail server and decide which (if any) recipient-checking patch to use would feel the same way.
      
      It's far less load on the servers to have a more e
  - Re:Combo of SpamAssassin and Spamhaus (Score:3, Insightful)
    
    by antifoidulus ( 807088 ) writes:
    
    Heh, even if you are reasonably diligent in protecting your email address, 9/10 it will still get out(though maybe not as bad). All it takes is one recipient with a compromised windows box and your address can be all over the spammers lists in no time.
    Or, as in my case, you could assume that a university you apply to will not send out a giant mass email to all the incoming graduate students inviting them to the graduate orientation. So now I have the email address of every grad student entering the Univ
  - Re:Combo of SpamAssassin and Spamhaus (Score:3, Funny)
    
    by jdowland ( 764773 ) writes:
    
    The key is still: don't give out your address. Once you've done that, you're going to be screwed eventually.
    
    Nah, that's such a half measure. The real solution is to not have an email address at all.
- Re:Combo of SpamAssassin and Spamhaus (Score:2)
  
  by xenobyte ( 446878 ) writes:
  
  At work we've set up a combination of SpamAssassin and Spamhaus. Personally I've went from about 10 spams per day to about 1 every two weeks.
  
  Amazing! - We've been using that combo for a long time and I get about 5-10 spams AN HOUR coming through the filters (and about the same amount caught). This is all personalized spam sent to one specific email address. That address was used in the past for a few newsgroup postings, a few technical forums and it was listed on a webpage some time ago. No spam sent to it
Fantastic Spam Filters Which Work Best Proving! (Score:5, Funny)

by _vSyncBomb ( 50710 ) writes: on Thursday August 03, 2006 @12:30AM (#15837375) Journal

Hey Slashdot, what's up, man! Dude, I read your thing and like totally agree about Best Work Proving Spam Site Work! Dude, that's awesome!

Bro, in the same vein, I was totally checking out this dope ass site [microsoft.com] which you might wanna check out [doubleclick.net] too man. Guys like us that dig Spam Which Proving and Best work Filters will be all over this before long...

OK, man take care until I see you this Friday at the dinner thing, Slashdot!

Cheers,
John

Share
twitter facebook
- Re: Very Interesting And Generally Really Amusing (Score:5, Funny)
  
  by Anonymous Coward writes: on Thursday August 03, 2006 @12:55AM (#15837473)
  
  Hey _vSyncBomb,
  
  Having trouble pleasing your woman? I've got something Very Interesting And Generally Really Amusing that you could try!!!
  
  Your buddy,
  _vAnoymousCoward
  
  Parent Share
  twitter facebook
- Amusingly, POPFile caught you (Score:5, Interesting)
  
  by patio11 ( 857072 ) writes: on Thursday August 03, 2006 @04:15AM (#15837987)
  
  I ran your message through a perl script to mail it to me for giggles (I do research on spam filtering at ye olde day job). Regretfully, you didn't make it through. Aside from header garbage, which was a mixed bag (half spam tokens, half "known-good automated email" tokens), you ran into problems with dope, ass, wanna, and... work*. Which is just as well, as I have no desire to speak to anyone who uses those words. * Last 15 occurrences in my mailbox are all of the "Make l0ads of $$$ work @ h0m3!" variety.
  
  Parent Share
  twitter facebook
RTFA? (Score:5, Insightful)

by glowworm ( 880177 ) writes: on Thursday August 03, 2006 @12:39AM (#15837408) Journal

So, how are we supposed to RTFA then the FA is over 470MB and a video file. Why not just a nice simple text summary Mr Submitter, but nooooo that would just be too easy!

Share
twitter facebook
- Re:RTFA? (Score:5, Funny)
  
  by emag ( 4640 ) writes: <slashdot@nosPAm.gurski.org> on Thursday August 03, 2006 @12:55AM (#15837475) Homepage
  
  "We are sorry that these talks are not available as plain HTML, PDF, or text, however under present IST policy we are not allowed to provide plain HTML, PDF, or text."
  
  Parent Share
  twitter facebook
  - Re:RTFA? (Score:2)
    
    by Enderandrew ( 866215 ) writes:
    
    Yes, but the person submitting the story to Slashdot when preparing their little blurb could have spilled the results.
- Re:RTFA? (Score:2)
  
  by cerberusss ( 660701 ) writes:
  
  Yeah now the tubes are full again.
Not surprising... (Score:4, Insightful)

by RealGrouchy ( 943109 ) writes: on Thursday August 03, 2006 @12:45AM (#15837433)

Although I haven't WTFV (watched the video), it doesn't seem surprising that spam filters which use techniques that aren't used widely would be most successful.

If they aren't used widely, it would either be because they don't work, or they do work but they haven't caught on [yet].

It's like any other fad. As an example, when the original Survivor series came out, it was really popular because it achieved its goal (attracting viewers) in a way that was original. Heck, even I watched the original one. Now that all the networks are doing the reality TV thing, it has become hackneyed, and each successive version of survivor does a worse job of achieving its goal. And I've given up watching TV.

With antispam, new techniques are effective, but as they become more popular and more widely used, spammers will find equally innovative ways of getting around them.

I've noticed that at any given time, there will be a particular style of (non-blank) spam that manages to get through Gmail's filters fairly consistently, but every now and then Gmail adapts its spam filters to block the successful spam type of the season, and eventually a new type will make its way through.

- RG>

Share
twitter facebook
- Re:Not surprising... (Score:2)
  
  by Tweekster ( 949766 ) writes:
  
  Spam is easy to take care of, well 99% of it. the rest isnt a big deal so who cares.
  
  My office went from 2000 spam mails a day to about 10. across 15 employees. Who gives a crap about the 10 emails remaining...
  
  I only wish it could be taken care of upstream further to shut those pricks down. but for the end user in an admins perspective, most systems are pretty easy to deal with (particularly small offices)
Got to go with Brightmail (Score:5, Informative)

by saha ( 615847 ) writes: on Thursday August 03, 2006 @12:46AM (#15837437)

We use Brightmail [brightmail.com] on our campus and our users love it with its very low false positive and pretty accurate flagging of SPAM. Another campus uses DSPAM and some people are up in arms at the prospect of losing their Brightmail to switch to DSPAM. Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.

I also echo a gripe of other posters. Its nice to have a video but 500MB video file it a bit much. A 50KB pie chart or bar graph would have been nice.

Share
twitter facebook
- Re:Got to go with Brightmail (Score:3, Informative)
  
  by hacker ( 14635 ) writes:
  
  Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.
  
  And what happened when you retrained those false positives as ham? Did you see future mails of the same/similar type get caught again? I bet you didn't.
  I've been using dspam for a very long time for my users, and they love it. They love having zero spam in their mailbox, they love the simplicity of the user interface. They love how it treats users on a per-user basis, not globally (i.e. so
Flaw in the test (Score:5, Informative)

by lheal ( 86013 ) writes: <lheal1999NO@SPAMyahoo.com> on Thursday August 03, 2006 @12:48AM (#15837444) Journal

The spammers actively try to subvert the more popular filters. That gives a lesser-known one a decided advantage, one which will go away as it becomes more popular.

As with most choices like this, factors such as ease of use, speed, and resource efficiency can overshadow selectivity. No system is perfect, so it's perfectly reasonable to go with a system that's pretty good if you already are using it, rather than switching to the latest cool thing.

I have found that using two dissimilar systems in a chain is quite effective.

Share
twitter facebook
- Re:Flaw in the test (Score:2)
  
  by MadAhab ( 40080 ) writes:
  
  Excellent point.
  
  And that applies to spam filtering techniques as well - it's like anti-biotics. For serious stuff, a spread attack is a good idea.
  
  I've found that using RBLs, SpamAssassin, and Bayesian filters prevents 99.5% of spam with essentially no false positives. And that means, by my day-to-day experience with addresses spammed for a full 10 years now, that instead of getting 100 spam and one real mail, I get 1 real mail, and once every could of days a spam that gets through.
  
  Except for earlier this ye
  - Re:Flaw in the test (Score:3, Insightful)
    
    by Jeffrey Baker ( 6191 ) writes:
    
    The problem with the spam filters, which you have stated, is that eventually a spammer figures out how to craft a spam which avoids the feature detection systems. Right now there's some zombie network sending around a stock market scam, of which I am getting roughly 300 copies per hour, even though spamassassin correctly classifies virtually all other unwanted mail.
    
    Lately, I've been thinking about this problem a lot. The classic method of computer classification systems (Bayes, SVM, whatever) are all base
    - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
      - Re:Flaw in the test (Score:2)
        
        by prandal ( 87280 ) writes:
        
        Use SA 3.1.4 and run-sa-update.
        
        Theo van Dinter added a rule to catch these to the core rules on Tuesday.
    - Re:Flaw in the test (Score:3, Insightful)
      
      by perlchild ( 582235 ) writes:
      
      A web of trust will work only until someone you trust's computer gets subverted. The zombie network you mentioned doesn't happen by itself. Now the smaller, more technically proficient web of trust, the less likely it is to be subverted, but it's still vulnerable to someone you trust having their computer hijacked.
obscurity (Score:2)

by TheSHAD0W ( 258774 ) writes:

It may not be coincidence that a little-known filter algorithm produces the best results; many spammers probably test their spew on the more popular filters to try and fool them. If this new filter becomes more popular you may see its reliability decay.
- Re:obscurity (Score:2)
  
  by pe1chl ( 90186 ) writes:
  
  This is very true.
  I have a successful spamfilter deployed at work. It uses SpamAssassin for the backend filtering, but that part has to do very little.
  The bulk of the rejecting is done in the dedicated SMTP engine that receives the mail. There is a lot of information to be deduced from the SMTP transaction itself, which is normally not used by spamfilters.
  Close adherence to RFC standards is something that most SMTP servers have achieved quite well, and the tools the spammers use are very bad at it.
  I know
I got the 400M download! (Score:4, Funny)

by Ossifer ( 703813 ) writes: on Thursday August 03, 2006 @12:55AM (#15837474)

And I printed out every frame so I could scan them. I'll be posting the TIFFs on my website shortly...

Share
twitter facebook
text versions of the material (Score:5, Informative)

by martin-boundary ( 547041 ) writes: on Thursday August 03, 2006 @01:13AM (#15837527)

For those who don't relish downloading 400MB worth of video (why can't somebody cut out the audio as a standalone MP3?), the material of the talk is also available in text mode.
The official tests of spamfilters were done in last year's TREC conference, you can read the writeup here [uwaterloo.ca] (or pdf overview [uwaterloo.ca]).
You can duplicate those tests yourself if you download the evaluation toolkit (GPL) [uwaterloo.ca]. It's a modular system where you can add a mail corpus (either one of the public TREC ones, or you can make your own trivially), and add a spamfilter package (there are 10 or so to download from the web, or create your own as per documentation).
There's also a video talk [researchchannel.org] given at Microsoft research which should cover pretty much the same ground, if text mode is slashdotted :).
There's a new scheduled test towards the end of the year at TREC 2006.

Share
twitter facebook
- Re:text versions of the material (Score:2)
  
  by IMarvinTPA ( 104941 ) writes:
  
  Which link is the write up?
  
  IMarv
Only one question... (Score:2)

by fm6 ( 162816 ) writes:

Is there any filter that doesn't give false positives? I don't mean "almost none", I mean zero . It isn't a matter of "holding out for perfect". Some of us simply can't afford to have a key email discarded as "spam".
- Re:Only one question... (Score:3, Insightful)
  
  by Jeffrey Baker ( 6191 ) writes:
  
  There is no classification system with zero real risk, except for delivering all mail to the Inbox. Sorry.
  
  If your mail is that important, you should be using couriers instead of email.
  - Human classification is not zero risk (Score:2)
    
    by patio11 ( 857072 ) writes:
    
    How many spam do you get a day? I get hundreds. Half of them are not in my native language (much like half the mail in my inbox), which means it takes more than a split-second glance to figure out what is going on. I'd guess my accuracy in split-second decisions is probably on the order of 95%, which if I were a spam filter would earn me a D-. Paul Graham, who probably has more typical email habits when compared with the average Slashdotter, says he misses about 3 per 2,000. http://www.paulgraham.com/w [paulgraham.com]
  - Re:Only one question... (Score:2)
    
    by Eivind Eklund ( 5161 ) writes:
    
    Delivering all mail to the inbox has a real risk: Human classification error, which AFAIR tend to run at about 0.1%. This is higher than some automated systems.
    Eivind.
- Re:Only one question... (Score:2)
  
  by Cylix ( 55374 ) writes:
  
  Well,
  
  You could have it only filtered completely if it's suspect rating is high enough and then otherwise just tag it if the rating is below a certain point.
  
  That said... white lists are your friends.
  
  Funny thing though... someone forwarded me some "funny" e-mail and usually they are not that humorous. I was so damned pleased when it was filtered out.
  
  That said, I haven't moved to deletion just yet. I just tag the mail and sort it later. As soon as I'm sufficiently happy with the system highly suspect mails can
- Re:Only one question... (Score:2)
  
  by cruachan ( 113813 ) writes:
  
  Cloudmark's safetybar product (http://www.cloudmark.com/ - lousy name, SpamNet which it was before was far better) is just about perfect for me. I get an average of about 20 spam emails a day and it has a false positive result of 0% and has had for months. In fact I've been using the product for several years now and I think the last time I saw a false positive was a couple of years back.
  
  On the efficiency side it has a hit rate of nearly 100%. I would have said it was 100% a couple of months back, but ju
Ask Slashdot ... (Score:5, Funny)

by Anonymous Coward writes: on Thursday August 03, 2006 @01:34AM (#15837587)

Dear Slashdot,
At the university where I work, they have recently adopted a pesky policy banning the use of bitTorrent.
What can I do to fix [uwaterloo.ca] this ?
Yours faithfully,
Dr. Gord Cormack

Share
twitter facebook
Argh! Gratuitous Video! (Score:2, Insightful)

by abh ( 22332 ) writes:

A 400mb video file? Is this a joke? WTF is everyone thinking that everything on the web needs to be on video all of a sudden. I just blogged about this today: http://www.anotherblogger.com/2006/08/02/please-no -more-gratuitous-videoblogging/ [anotherblogger.com]
Good job the I don't filter web content (Score:2, Funny)

by slayer99 ( 15543 ) writes:

"In his study he looked at the major spam filters ( DSPAM, SpamAssasian"
Spam about asian donkeys is a new one on me, though.
MS Anti Spam... (Score:2)

by pookemon ( 909195 ) writes:

I use the built in Spam filter in Exchange 2k3 set to level 8. All "filtered" e-mails are archived. I get maybe 3 or 4 a day (on a "bad" day) that make it through. Once a week (or more if I can be bothered) I view the archive and send on any that aren't spam (<1%) on and those that are spam get junked. I do this using a little tool I wrote that displays the From, To and subject of all these e-mails. If I can't tell from these fields whether the e-mail is a SPAM or not (and it generally is anyway) th
- Re:MS Anti Spam... (Score:2)
  
  by KiloByte ( 825081 ) writes:
  
  Er, what? A false positive rate of 1:100!?!?
  
  Usually, anti-spam solutions which give more than 1:100000 are considered worthless. What you're quoting is beyond words.
  - Re:MS Anti Spam... (Score:2)
    
    by pookemon ( 909195 ) writes:
    
    A false positive rate of 1:100
    
    No, better than 1:100 - that's what <1% means. It's actually around the 1:500
    
    Usually, anti-spam solutions which give more than 1:100000 are considered worthless
    
    Got links, or is that just your opinion?
    - Re:MS Anti Spam... (Score:3, Informative)
      
      by KiloByte ( 825081 ) writes:
      
      A false positive rate of 1:100
      No, better than 1:100 - that's what <1% means. It's actually around the 1:500
      And thus still 200 times worse than the acceptable rate.
      Usually, anti-spam solutions which give more than 1:100000 are considered worthless
      Got links, or is that just your opinion?
      There was a massive flamefest on debian-devel about spam filtering recently, but false positive ratios in that range were something commonly used by most participants in the discussion. I don't have the time to
So what is the previously unheard of spam filter? (Score:2)

by jefp ( 90879 ) writes:

Anyone care to post a link?
No bittorrent... No credibility (Score:5, Insightful)

by bgog ( 564818 ) * writes: on Thursday August 03, 2006 @02:33AM (#15837752) Journal

Why exactly should be give any weight to anything from and organization so ignorant as to disallow bittorrent? I take someone pretty darn ignorant to disallow a protocol because some use it to transport illegal content. Why havn't then banned TCP? It is an evil technology used every day to violate copyright.

This guy should spend his time educating the fools at his institution.

Share
twitter facebook
Possible Text Version (Score:4, Informative)

by sciop101 ( 583286 ) writes: on Thursday August 03, 2006 @02:35AM (#15837755)

On-line Supervised Spam Filter Evaluation
Gordon Cormack and Thomas Lynam

Full Text, May 29, 2006 - PDF Format

http://plg.uwaterloo.ca/~gvcormac/spamcormack.html / [uwaterloo.ca]

Share
twitter facebook
- - Re:Possible Text Version (Score:4, Informative)
    
    by gvc ( 167165 ) writes: on Thursday August 03, 2006 @09:45AM (#15839146)
    
    Bogofilter works great. Or SpamAssassin but only if you force-feed it its own judgements [uwaterloo.ca]. In both cases you have to correct classification errors.
    Fidelis Assis (who has now gone solo after having participated in the CRM114 project) shows great results for his recent solo effort: OSBF-lua [luaforge.net] Bratko's PPM spam filter [ai.ijs.si] -- the one that did great at TREC -- is not yet packaged as a drop-in filter. Same for my DMC spam filter [www.ceas.cc].
    The actual TREC 2005 tests referred to in TFA are here. [uwaterloo.ca]
    
    Parent Share
    twitter facebook
GMail Spam Filter (Score:5, Interesting)

by foxylad ( 950520 ) writes: on Thursday August 03, 2006 @02:48AM (#15837790) Homepage

I use greylisting (gld to be specific) which works wonderfully. A couple of customers wanted even better filtering...

First I tried DSPAM, but they refused to train it so the results weren't good. Then I tried Spam Assasin, which also let through a suprising amount of spam - a lot more than my personal account on Gmail.

So I set up accounts on Gmail for them, and forwarded their mail to those accounts (after greylisting - don't want to burden GMail too much!). Gmail lets you set up forwarding, so I simply forwarded all the filtered mail back to a second account on my mailserver for the customer to pick up. Finally I wrote a python script that logs in to Gmail once a week to prevent the account being closed due to non-use.

A tad involved, but it works like a dream. Yet again Google comes out on top, this time in a market it doesn't even know it's in!

Share
twitter facebook
- Re:GMail Spam Filter (Score:2, Interesting)
  
  by sd.fhasldff ( 833645 ) writes:
  
  This is actually something Google could sell. Access to their mail filter. I do realize that they have "corporate email", but that still smacks a lot of GMail and some businesses would rather avoid that. Instead, they could provide a simple access to their spam filter. Yes, requiring all email to be piped through a Google server if they don't want to make the filter available as a binary (presumably updated regularly).
  
  To minimize bandwidth consumption and (partly, at least) allay privacy / corporate secre
So Which One Won? (Score:2, Interesting)

by ryanisflyboy ( 202507 ) writes:

So which one is the "unheard of spam filter?"

Wouldn't it make sense to put this in the /. submission (or at least a link).

Did I miss the obvious "and the winner is..." some place?
Cloudmark's SpamNet (Score:3, Interesting)

by cruachan ( 113813 ) writes: on Thursday August 03, 2006 @03:40AM (#15837906)

I have to push this as it usually gets missed from reviews as it's a hybrid P2P solution and not a straightforward filter, but Cloudmark's safetybar product (http://www.cloudmark.com/) is just about perfect for me. I get an average of about 20 spam emails a day and it has a false positive result of 0% and has had for months. In fact I've been using the product for several years now and I think the last time I saw a false positive was a couple of years back.

On the efficiency side it has a hit rate of nearly 100%. I would have said it was 100% a couple of months back, but just recently it's been having a bit of a problem with one stock-pushing spam.

Anyway, that aside it's the best spam filter I've ever seen by a very long way, and I'd highly recommend the service. It costs a few $ a month, but it's probably the best value subscription I have.

I have no connection with the company, just a very satisfied customer who's been using it since the beta some years ago. I have a publically available email address which I've had for years and must be on many spam lists, without Cloudmark it would be unusable, with it it's no problem at all. I recently installed it for my wife who was starting to get a lot of spam - on that I noticed it took about two weeks to get it trained not to junk a few mailing list emails she was on, but after that it's been just as highly reliable as my installation.

Share
twitter facebook
Best spam filter. (Score:2)

by Viceice ( 462967 ) writes:

IMHO, the criteria for best spam filter is very simple. It is the filter that is able to consistantly maintain the highest spam to false positive ratio.

Feel free to add to it. :D
Out of Date and Worthless (Score:5, Informative)

by prandal ( 87280 ) writes: on Thursday August 03, 2006 @04:09AM (#15837975)

This paper's a complete waste of time.

He tested spamassassin 2.3 - that's ancient! I'd imagine the other tools are similarly obsolete.

We currently use SA 3.1.4 with a well-trained Bayes database and Razor, Pyzor, and DCC.

Throw in a few custom rules and a selection of rules from http://www.rulesemporium.com/ [rulesemporium.com] and the results are outstanding.

With the new sa-update feature the core rules are updated between point releases, which came in useful this week dealing with the new image spams which seemed to be designed to avoid detection by spamassassin. Thanks Theo.

And the folk on the spamassassin-users mailing list really rock.

Share
twitter facebook
- Re:Out of Date and Worthless (Score:4, Informative)
  
  by gvc ( 167165 ) writes: on Thursday August 03, 2006 @09:18AM (#15838974)
  I assume the paper that you are describing is the 2004 study [uwaterloo.ca]. The paper described in the talk (which was given 6 months ago or so) described results of the TREC 2005 Spam Track [uwaterloo.ca] which took place in November 2005. It included a test SpamAssassin 3.x, not 2.3.
  TREC 2006 [nist.gov] evaluations are now underway [uwaterloo.ca].
  While it is reasonable to conjecture that spam has changed so as to defeat spam filtering techniques, or will change so as to defeat the PPM technique that did well at TREC, the historical evidence does not support this conjecture. In particular:
  
  The spam filters tested in 2004 give pretty well exactly the same performance on 2005 and 2006 data.
  
  New versions of the filters are a little bit better, but not by leaps and bounds, and also get about the same results over the last 2.5 years of data.
  
  There is no evidence that "Bayesian poisining" is a viable technique for defeating statistical spam filters in anything but a very artifical laboratory environment where the poisoner has access to the recipient's inbox
  
  The subject of the paper -- and the talk -- is primarily about testing methodology and the need for controlled scientific investigation. So I hesitate to endorse the simplistic notion of a "winner" of the TREC evaluation. However the technique that did very well [ai.ijs.si] was indeed quite novel, so here's a characterization.
  
  Andrej Bratko used PPM -- a well-known data compression technique to compress ham and spam separately. Well actually he didn't compress them but just build the statistical model necessary to compress them. Then he simply (tentatively) added the unknown message to each model and chose the one that compressed it best. The general technique of using compression has been mentioned here and elsewhere but Bratko used a much stronger compression scheme and was somewhat clever about it.
  I later reproduced Bratko's results using DMC -- a compression schem that I invented 20 years ago -- and got some interesting results. We have a journal article in press describing it and also an evaluation paper at CEAS 2006 [www.ceas.cc].
  Bratko A., Cormack G. V., Filipic B., Lynam T. R. and Zupan B., Spam Filtering Using Statistical Data Compression Models [uwaterloo.ca]
  Parent Share
  twitter facebook
It is a war (Score:3, Insightful)

by Alain Williams ( 2972 ) writes: <addw@phcomp.co.uk> on Thursday August 03, 2006 @04:42AM (#15838039) Homepage

Spam is a war between the spammers and the system administrators/spam filters. The spam filters adopt a new technique; then spammers then work round it; the spam filters advance; ...
By the time that I have downloaded the video the war will have moved on a couple of iterations ...

Share
twitter facebook
Way to go compression ! (Score:2, Interesting)

by bytesex ( 112972 ) writes:

It looks like another win for compression algorithms. Not only do they maximize entropy in your data while shortening it, they can also be used successfully to earmark pieces of text as being written in a certain language, or written by a certain author, and now they can be used for spam detection. The usefullness just keeps on coming. Colour me impressed.
Paul Vixie on botnets and spam (Score:3, Interesting)

by dodobh ( 65811 ) writes: on Thursday August 03, 2006 @06:57AM (#15838364) Homepage

See here [vix.com]

The key paragraph:

If you'd like a more topical example, consider "spam". People began altering their e-mail "From:" lines in order to make their addresses harder to guess or aggregate; people began doing pattern matching in order to catch known-bad messages and either sideline or reject them. Many defenders used many small tricks to protect their inboxes. The result has not been that less spam is sent or even that less spam is received, on an aggregate basis. Things are worse now than they've ever been. (I say this as co-founder of MAPS LLC, by which I hope to establish my credentials in the spam field for those of you who do not know me.) Today a small number of highly advanced defenders is spam-immune only because they are a small number and their techniques are not widely effective against the attackers; and a small number of highly advanced attackers can "spam at will" a far larger population than ever before. And the trend is that things are getting worse, and getting worse faster than ever before.

Share
twitter facebook
Dspam floats my boat (Score:3, Informative)

by Zzeep ( 682115 ) writes: <kenneth@@@vangrinsven...com> on Thursday August 03, 2006 @07:20AM (#15838419) Homepage

I receive (no kidding) around 600 spam mails per day, versus approximayely 30 real e-mails. I've been using dspam for over a year now (with very faithful training), and there is maybe 1 false positive every few weeks (less than 1 in 10.000) and every few days a few (usually "new") spam mails get through, which I ofcourse immediately train, to never see those kind again. So I am very very positive about dspam. What I do miss though is something like a good and reliable service (better than the RBL's I know) that can block SMTP clients on the fly (like DSL home users and such) to reduce the immense load on our mailservers (I work for an ISP) caused by all the spam (that also has to go through a virus scanner, clamav).

Share
twitter facebook
What about Greylisting? (Score:2)

by IMarvinTPA ( 104941 ) writes:

Sadly, the way this was done, there is no way to test how well Greylisting [puremagic.com] would have helped.

IMarv
Slides from the presentation (Score:3, Informative)

by gvc ( 167165 ) writes: on Thursday August 03, 2006 @10:36AM (#15839587)

Here are the slides from the 400MB video presentation. [uwaterloo.ca]

Share
twitter facebook
- Re:Torrents (Score:3, Interesting)
  
  by Pantero Blanco ( 792776 ) writes:
  
  I wonder how hard it would be for Slashdot/OSTG to host a tracker for large, article-related files like this. I don't think it would require a lot of funding to run, and it would certainly help with convention presentation videos.
- Re:fuck power went out! (Score:2, Funny)
  
  by lewp ( 95638 ) writes:
  
  I think it's trying to communicate with us...
- Re:I have one word: (Score:3, Informative)
  
  by Jeffrey Baker ( 6191 ) writes:
  
  I hope you also have another word, because the Postini service is incredibly bad. I had it enabled on my account at acm.org, and the Postini system was generating roughly one false positive for every 10 true positives. I disabled the Postini filtering and started using Spamassassin. Both the false positive and false negative rates are much improved. Among the traffic that Postini was flagging as spam were the Wikipedia article of the day, my daily email from musicbrainz.org, all messages to the BATN mai
- Re:Spam Ass Asian? (Score:2)
  
  by Shaper_pmp ( 825142 ) writes:
  
  Clearly that's the new fork of SpamAssassin that ensures only Vi4gra, penis-enlargement pills and "meet h0T n4k3d t33n s1uts" invitations get through...
- Re:Little known systems will often be most effecti (Score:2)
  
  by MichaelSmith ( 789609 ) writes:
  
  Reminds me of why I like living in Australia - globally speaking we're relatively irrelevant, making us a relatively small target. Hopefully we'll stay relatively irrelevant, lol :p
  
  And if it started getting worse you could move to tassie and get that feeling of irrelevancy back.
- Re:Give grey listing a try... (Score:2)
  
  by nblender ( 741424 ) writes:
  
  Greylisting was predicted to work for only a short time and that's how it worked out. Greylisting works only against zombies who try to send mail directly to your server via port 25. As more and more ISP's get smart and start blocking outbound 25 from their dynamic pools, greylisting (and relying on rDNS pattern matching to filter for dynamic pools) is becoming less and less effective. I am a mailing list owner for a large free open source operating system project. This project uses greylisting on its m
- Re:Why do they try? (Score:2, Insightful)
  
  by maubp ( 303462 ) writes:
  
  If an end user is trying to block spam, then yes, they are probably not the sort of person likely to buy your product. At least until spam-blocking becomes more main stream in email clients (e.g Mozilla Thunderbird).
  
  However, its very often the end user's ISP doing the spam filtering - and this has no direct bearing on the gullibility of the email recipient.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Easier? (Score:3, Insightful)

Harder! (Score:5, Funny)

Re:Harder! (Score:5, Funny)

Re:Harder! (Score:5, Insightful)

Re:Harder! (Score:2, Funny)

Re:Harder! (Score:2)

Re:Harder! (Score:5, Insightful)

Re:Harder! (Score:2, Insightful)

Re:Harder! (Score:2, Insightful)

Re:Harder! (Score:2)

Re:Harder! (Score:3, Funny)

Re:Harder! (Score:3, Informative)

Re:Easier? (Score:2)

In my experience... (Score:5, Informative)

Re:In my experience... (Score:3, Insightful)

Re:In my experience... (Score:3, Insightful)

Re:In my experience... (Score:2)

Re:In my experience... (Score:3, Insightful)

Re:In my experience... (Score:5, Funny)

Re:In my experience... (Score:2)

Re:In my experience... (Score:5, Insightful)

Re:In my experience... (Score:2, Funny)

Mod as Funny (Score:2)

Re:In my experience... (Score:2)

Re:In my experience... (Score:5, Informative)

Re:In my experience... (Score:3, Informative)

Re:In my experience... (Score:3, Insightful)

Why not just douse the server in gas... (Score:4, Funny)

Re:Why not just douse the server in gas... (Score:5, Funny)

Re:Why not just douse the server in gas... (Score:2, Informative)

Under present IST policy... (Score:4, Funny)

Torrent (Score:4, Informative)

Re:Torrent (Score:3, Informative)

Combo of SpamAssassin and Spamhaus (Score:2, Interesting)

Re:Combo of SpamAssassin and Spamhaus (Score:2, Insightful)

A good DUL helps (Score:2)

Re:Combo of SpamAssassin and Spamhaus (Score:3, Informative)

Re:Combo of SpamAssassin and Spamhaus (Score:2)

Re:Combo of SpamAssassin and Spamhaus (Score:3, Insightful)

Re:Combo of SpamAssassin and Spamhaus (Score:3, Funny)

Re:Combo of SpamAssassin and Spamhaus (Score:2)

Fantastic Spam Filters Which Work Best Proving! (Score:5, Funny)

Re: Very Interesting And Generally Really Amusing (Score:5, Funny)

Amusingly, POPFile caught you (Score:5, Interesting)

RTFA? (Score:5, Insightful)

Re:RTFA? (Score:5, Funny)

Re:RTFA? (Score:2)

Re:RTFA? (Score:2)

Not surprising... (Score:4, Insightful)

Re:Not surprising... (Score:2)

Got to go with Brightmail (Score:5, Informative)

Re:Got to go with Brightmail (Score:3, Informative)

Flaw in the test (Score:5, Informative)

Re:Flaw in the test (Score:2)

Re:Flaw in the test (Score:3, Insightful)

Re: (Score:2)

Re:Flaw in the test (Score:2)

Re:Flaw in the test (Score:3, Insightful)

obscurity (Score:2)

Re:obscurity (Score:2)

I got the 400M download! (Score:4, Funny)

text versions of the material (Score:5, Informative)

Re:text versions of the material (Score:2)

Only one question... (Score:2)

Re:Only one question... (Score:3, Insightful)

Human classification is not zero risk (Score:2)

Re:Only one question... (Score:2)

Re:Only one question... (Score:2)

Re:Only one question... (Score:2)

Ask Slashdot ... (Score:5, Funny)

Argh! Gratuitous Video! (Score:2, Insightful)

Good job the I don't filter web content (Score:2, Funny)

MS Anti Spam... (Score:2)

Re:MS Anti Spam... (Score:2)

Re:MS Anti Spam... (Score:2)

Re:MS Anti Spam... (Score:3, Informative)

So what is the previously unheard of spam filter? (Score:2)

No bittorrent... No credibility (Score:5, Insightful)

Possible Text Version (Score:4, Informative)

Re:Possible Text Version (Score:4, Informative)