×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Plan for Spam, Version 2

CmdrTaco posted more than 11 years ago | from the bayesian-filtering-for-a-quieter-inbox dept.

Spam 464

bugbear writes "I just posted a new version of the Plan for Spam Bayesian filtering algorithm. The big change is to mark tokens by context. The new version decreases spams missed by 50%, to 2.5 per 1000, even though spam has gotten harder to filter since the summer. I also talk about how spam will evolve, and what to do about it."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

464 comments

IN SOVIET RUSSIA (-1)

IN SOVIET RUSSIA (621411) | more than 11 years ago | (#5128075)

Spam Plans You!

Important Stuff: IN SOVIET RUSSIA topics post you! IN SOVIET RUSSIA other people's comments reply to you instead of starting new threads. IN SOVIET RUSSIA other people's messages are read before posting your own to avoid simply duplicating what has already been said you! IN SOVIET RUSSIA a clear subject that describes what your message is about uses you! Offtopic, Inflammatory, Inappropriate, Illegal, or Offensive comments WILL be moderated by dickhead mods. (IN SOVIET RUSSIA everything, even moderated posts, by adjusting your threshold on the User Preferences Page read you!) IN SOVIET RUSSIA replies to your comments sent to you, also consider logging in an account creates you!

Important Stuff: IN SOVIET RUSSIA topics post you! IN SOVIET RUSSIA other people's comments reply to you instead of starting new threads. IN SOVIET RUSSIA other people's messages are read before posting your own to avoid simply duplicating what has already been said you! IN SOVIET RUSSIA a clear subject that describes what your message is about uses you! Offtopic, Inflammatory, Inappropriate, Illegal, or Offensive comments WILL be moderated by dickhead mods. (IN SOVIET RUSSIA everything, even moderated posts, by adjusting your threshold on the User Preferences Page read you!) IN SOVIET RUSSIA replies to your comments sent to you, also consider logging in an account creates you!

Re:IN SOVIET RUSSIA (1, Funny)

Anonymous Coward | more than 11 years ago | (#5128125)

You, sir, are my hero. A true Stalin.

Re:IN SOVIET RUSSIA (-1)

Anonymous Coward | more than 11 years ago | (#5128255)

That was the best first post ever. You deserve some sort of award or badge or something!

More than 1.1 billion pigs are killed worldwide ea (-1, Offtopic)

Amsterdam Vallon (639622) | more than 11 years ago | (#5128090)

More than 1.1 billion pigs are killed worldwide each year. For no reason.

Pork is an unhealthy food source. Most people who eat pork also have access to other, non-meat foods.

Pigs are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?

Re:More than 1.1 billion pigs are killed worldwide (-1, Offtopic)

Medieval (41719) | more than 11 years ago | (#5128159)

Pigs are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?

Damn straight!

Re:More than 1.1 billion pigs are killed worldwide (-1, Offtopic)

EatHam (597465) | more than 11 years ago | (#5128222)

"Dolphins are intelligent and friendly creatures!"

"Yeah. Intelligent and friendly on rye bread with some mayonnaise"

Re:More than 1.1 billion pigs are killed worldwide (-1, Offtopic)

SScorpio (595836) | more than 11 years ago | (#5128176)

What's wrong with eating pig?

Homer: "You're not going to eat any meat, Lisa?"
Lisa: "No"
Homer: "Not pork chops?"
Lisa: "Nope"
Homer: "Or ham?"
Lisa: "No!"
Homer: "Or bacon???"
Lisa: "Dad, those all come from the same animal!"
Homer: "Yes, Lisa, a magical animal from fantasy land!"

Re:More than 1.1 billion pigs are killed worldwide (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#5128184)

Homer: "Are you saying you're never going to eat any animal again? What about bacon?"
Lisa: "No."
Homer: "Ham?"
Lisa: "No!"
Homer: "Pork chops?"
Lisa: "Dad, those all come from the same animal!"
Homer: "Heh heh heh ... ooh ... yeah ... right, Lisa. A wonderful ... magical animal."

Re:More than 1.1 billion pigs are killed worldwide (5, Funny)

molarmass192 (608071) | more than 11 years ago | (#5128204)

Could Bayesian filtering be applied to filter offtopic posts as well?

1.post, 2. troll, 3. ????, 4.Profit! (-1, Troll)

Anonymous Coward | more than 11 years ago | (#5128274)

  • Slashdot subscription - $10
  • DSL line - $29.99/mo
  • 486sx/16 - $5 at a yardsale
  • FreeBSD - $2.98 @ cheapbytes
  • beating a dead horse - PRICELESS

Re:More than 1.1 billion pigs are killed worldwide (-1, Offtopic)

kaptin (8996) | more than 11 years ago | (#5128223)

Spam actually isn't a Pork product.

Re:More than 1.1 billion pigs are killed worldwide (0)

Anonymous Coward | more than 11 years ago | (#5128359)

Pigs are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?

Pigs would be pretty rare if we never killed them.

Great Stuff! Hope to see more (1)

leerpm (570963) | more than 11 years ago | (#5128092)

Hopefully we see even more stuff like this coming out of the spam conference

Blocking spam should be illegal. (-1, Troll)

Anonymous Coward | more than 11 years ago | (#5128098)

Ads are what keep the Internet running. By blocking spam, you stifle a vital industry.

If you love the Internet you will take all the spam you can get. I just hope laws are made that disallow anti-spam measures.

The government really needs to step in here.

Re:Blocking spam should be illegal. (-1, Troll)

Anonymous Coward | more than 11 years ago | (#5128192)

If you love the Internet you will take all the spam you can get.

You suck. This measure is half hearted. Taking all the spam you can get does not help anyone, let alone the Internet. People need to respond to the spam, as well.

Failing to respond to spam should be illegal.

GREAT! (1)

TerryAtWork (598364) | more than 11 years ago | (#5128103)

I run PopFile at work and it rules!

Please carry on with this Bayesian Spam filtering! It'll be the death of spam yet!

popfile URL (4, Informative)

roalt (534265) | more than 11 years ago | (#5128300)

Popfile can be installed as an intermediate between your mail-server and your program, and you can add tags to your mail to decide in which 'bucket' your mail belongs to.

The url for the project is popfile.sourceforge.net [sourceforge.net]

I didn't try it yet, but it I will try it really soon now!

Re:GREAT! (0)

Anonymous Coward | more than 11 years ago | (#5128323)

I've been using it too. It's great, the best damn spam filter IMO you can get, if you have a POP3 connection.

popfile [sourceforge.net]

How is spam that big of a problem? (-1, Informative)

Anonymous Coward | more than 11 years ago | (#5128106)

Simply use a free account for any registration required sites / internet posting and only check it when necessary to confirm registration.

Use another account for regular everyday things, and make sure it sin't something simple like abc123@hotmail.com. I do that and never get spam to my real accounts. This whole spam thing is way overblown.

Re:How is spam that big of a problem? (3, Insightful)

crawdaddy (344241) | more than 11 years ago | (#5128212)

Overblown? The fact that you would need more than one email account to keep from having your time wasted by spam proves otherwise.

Re:How is spam that big of a problem? (1, Funny)

Anonymous Coward | more than 11 years ago | (#5128319)

Why are you posting Anonymous Coward? Are you afraid someone will post your email to a few spam lists. :)

I thought that too... (2, Informative)

siskbc (598067) | more than 11 years ago | (#5128350)

...until the email server at work got hacked and someone stole the entire address list. Since then, all of us have been getting spam by the bucketloads. And since I depend on people being able to get my current work address, I can't change it. Thank God for SpamAssassin!

Re:How is spam that big of a problem? (3, Informative)

Anonymous Coward | more than 11 years ago | (#5128352)

It's all fine and dandy to have a spamtrap account if you never plan to read it, but what if you want to get online bank statement notifications or other important notices? I just noticed my friendly credit card company (Capital One) took it upon themselves to introduce my previously spam-free e-mail account to their business partners so they could introduce me to the wonderful world of buying fucking flowers for valentines day. Thanks alot assholes. And no, they have NO option to opt out of this fucking crap. The spam is posted from the same address as the statement notifications with a friendly disclaimer saying they're not in any way affiliated. Nice.

base64 encoded emails....or images (1, Interesting)

Anonymous Coward | more than 11 years ago | (#5128108)

I'm using the filters in Moz 1.3 alpha and the Base64 encoded emails are not being recoginized and flagged as spam. I've trained and trained and trained.

They almost always get through.

Anyone else experience this?

Also, how do can you flag an ad that is an image? Block all HTML email?

I dunno.

Re:base64 encoded emails....or images (-1, Troll)

stratjakt (596332) | more than 11 years ago | (#5128167)

Yes, you block all HTML.

To a geek, the only acceptable email is plaintext with no markups, or even capital letters or punctuation.

That's why this won't 'eliminate' spam. Sure it can keep it out of your corporate intranet, if you expect only plaintext memos, but it won't let grandma know if the picture in her inbox is of one of her grandchildren or this guy. [goatse.cx]

You, sir, are a insensitive clod! (-1, Troll)

Anonymous Coward | more than 11 years ago | (#5128355)

That guy [goatse.cx] is somebody's grandson!

(actually four somebody's grandson, but you get the point!)

Re:base64 encoded emails....or images (5, Interesting)

delta407 (518868) | more than 11 years ago | (#5128231)

Also, how do can you flag an ad that is an image?
Razor [sourceforge.net] .

Vipul's Razor marks MIME parts individually, so an ad, a picture of Viagra, or even the "Unsubscribe" button can be marked spam and contribute to the overall score of the message.

Re:base64 encoded emails....or images (1)

IvyMike (178408) | more than 11 years ago | (#5128273)

This doesn't really solve your precise problem, but at least makes some html spam less annoying.

1: Preferences->Privacy&Security->Images->" Do not load remote images in Mail..." should be checked.

2: In Mail, "View->Message Body As->Simple HTML" should or even "Plaintext"

This won't help you filter the spam, but will prevent web-bugged email from confirming that you are a valid spam target, and makes the spam that does get through be far less annoying.

Re:base64 encoded emails....or images (0)

Anonymous Coward | more than 11 years ago | (#5128291)

I dispose of all base64 encoded emails.

Problem solved.

Re:base64 encoded emails....or images (4, Informative)

GGardner (97375) | more than 11 years ago | (#5128297)

A common thing that spammers to do try and trick filters is use

Content-Type: text/html (or text/plain)
Content-Transfer-Encoding: base64

Because a lot of filters don't know how to decipher this. For me, this makes it a lot easier to filter, though. I get no legitimate e-mail encoded this way, so I just have procmail dump any e-mail encoded this way. Problem solved, and without the CPU burden of decoding or running expensive spam filters.

I'm sorry, but someone has to say it... (2, Funny)

Yoda2 (522522) | more than 11 years ago | (#5128115)

But will it enlarge my penis?

Re:I'm sorry, but someone has to say it... (1)

wackysootroom (243310) | more than 11 years ago | (#5128143)

No, but it will make you a million dollars and attract women effortlessly through the power of irresistable pheromones.

Slow news day? (0, Flamebait)

Koyaanisqatsi (581196) | more than 11 years ago | (#5128117)

I just posted a new version of ...

While I recognize it's a valid project, this type of announcement is more deserving of a frontpage at Sourceforge or Freshmeat. Now if there was a huge breakthrough, we could expect to see it posted here, right?

Archive Version (b/c it's a personal site) (5, Interesting)

Amsterdam Vallon (639622) | more than 11 years ago | (#5128123)

January 2003

(This article was given as a talk at the 2003 Spam Conference. It describes the work I've done to improve the performance of the algorithm described in A Plan for Spam, and what I plan to do in the future.)

The first discovery I'd like to present here is an algorithm for lazy evaluation of research papers. Just write whatever you want and don't cite any previous work, and indignant readers will send you references to all the papers you should have cited. I discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot.

Spam filtering is a subset of text classification, which is a well established field, but the first papers about Bayesian spam filtering per se seem to have been two given at the same conference in 1998, one by Pantel and Lin [2], and another by a group from Microsoft Research [3].

When I heard about this work I was a bit surprised. If people had been onto Bayesian filtering four years ago, why wasn't everyone using it? When I read the papers I found out why. Pantel and Lin's filter was the more effective of the two, but it only caught 92% of spam, with 1.16% false positives.

When I tried writing a Bayesian spam filter, it caught 99.5% of spam with less than .03% false positives [4]. It's always alarming when two people trying the same experiment get widely divergent results. It's especially alarming here because those two sets of numbers might yield opposite conclusions. Different users have different requirements, but I think for many people a filtering rate of 92% with 1.16% false positives means that filtering is not an acceptable solution, whereas 99.5% with less than .03% false positives means that it is.

So why did we get such different numbers? I haven't tried to reproduce Pantel and Lin's results, but from reading the paper I see five things that probably account for the difference.

One is simply that they trained their filter on very little data: 160 spam and 466 nonspam mails. Filter performance should still be climbing with data sets that small. So their numbers may not even be an accurate measure of the performance of their algorithm, let alone of Bayesian spam filtering in general.

But I think the most important difference is probably that they ignored message headers. To anyone who has worked on spam filters, this will seem a perverse decision. And yet in the very first filters I tried writing, I ignored the headers too. Why? Because I wanted to keep the problem neat. I didn't know much about mail headers then, and they seemed to me full of random stuff. There is a lesson here for filter writers: don't ignore data. You'd think this lesson would be too obvious to mention, but I've had to learn it several times.

Third, Pantel and Lin stemmed the tokens, meaning they reduced e.g. both ``mailing'' and ``mailed'' to the root ``mail''. They may have felt they were forced to do this by the small size of their corpus, but if so this is a kind of premature optimization.

Fourth, they calculated probabilities differently. They used all the tokens, whereas I only use the 15 most significant. If you use all the tokens you'll tend to miss longer spams, the type where someone tells you their life story up to the point where they got rich from some multilevel marketing scheme. And such an algorithm would be easy for spammers to spoof: just add a big chunk of random text to counterbalance the spam terms.

Finally, they didn't bias against false positives. I think any spam filtering algorithm ought to have a convenient knob you can twist to decrease the false positive rate at the expense of the filtering rate. I do this by counting the occurrences of tokens in the nonspam corpus double.

I don't think it's a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure. Spam filtering is not just classification, because false positives are so much worse than false negatives that you should treat them as a different kind of error. And the source of error is not just random variation, but a live human spammer working actively to defeat your filter.

Tokens

Another project I heard about after the Slashdot article was Bill Yerazunis' CRM114 [5]. This is the counterexample to the design principle I just mentioned. It's a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that's what it's doing.

Once I understood how CRM114 worked, it seemed inevitable that I would eventually have to move from filtering based on single words to an approach like this. But first, I thought, I'll see how far I can get with single words. And the answer is, surprisingly far.

Mostly I've been working on smarter tokenization. On current spam, I've been able to achieve filtering rates that approach CRM114's. These techniques are mostly orthogonal to Bill's; an optimal solution might incorporate both.

``A Plan for Spam'' uses a very simple definition of a token. Letters, digits, dashes, apostrophes, and dollar signs are constituent characters, and everything else is a token separator. I also ignored case. Now I have a more complicated definition of a token:

Case is preserved.

Exclamation points are constituent characters.

Periods and commas are constituents if they occur between two digits. This lets me get ip addresses and prices intact.

A price range like $20-25 yields two tokens, $20 and $25.

Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. E.g. ``foo'' in the Subject line becomes ``Subject*foo''. (The asterisk could be any character you don't allow as a constituent.)
Such measures increase the filter's vocabulary, which makes it more discriminating. For example, in the current filter, ``free'' in the Subject line has a spam probability of 98%, whereas the same token in the body has a spam probability of only 65%.

In the Plan for Spam filter, all these tokens would have had the same probability, .7602. That filter recognized about 23,000 tokens. The current one recognizes about 187,000.

The disadvantage of having a larger universe of tokens is that there is more chance of misses. Spreading your corpus out over more tokens has the same effect as making it smaller. If you consider exclamation points as constituents, for example, then you could end up not having a spam probability for free with seven exclamation points, even though you know that free with just two exclamation points has a probability of 99.99%.

One solution to this is what I call degeneration. If you can't find an exact match for a token, treat it as if it were a less specific version. I consider terminal exclamation points, uppercase letters, and occurring in one of the five marked contexts as making a token more specific. For example, if I don't find a probability for ``Subject*free!'', I look for probabilities for ``Subject*free'', ``free!'', and ``free'', and take whichever one is farthest from .5.

Here are the alternatives [7] considered if the filter sees ``FREE!!!'' in the Subject line and doesn't have a probability for it.

If you do this, be sure to consider versions with initial caps as well as all uppercase and all lowercase. Spams tend to have more sentences in imperative voice, and in those the first word is a verb. So verbs with initial caps have higher spam probabilities than they would in all lowercase. In my filter, the spam probability of ``Act'' is 98% and for ``act'' only 62%.

If you increase your filter's vocabulary, you can end up counting the same word multiple times, according to your old definition of ``same''. Logically, they're not the same token anymore. But if this still bothers you, let me add from experience that the words you seem to be counting multiple times tend to be exactly the ones you'd want to.

Another effect of a larger vocabulary is that when you look at an incoming mail you find more interesting tokens, meaning those with probabilities far from .5. I use the 15 most interesting to decide if mail is spam. But you can run into a problem when you use a fixed number like this. If you find a lot of maximally interesting tokens, the result can end up being decided by whatever random factor determines the ordering of equally interesting tokens. One way to deal with this is to treat some as more interesting than others.

For example, the token ``dalco'' occurs 3 times in my spam corpus and never in my legitimate corpus. The token ``Url*optmails'' (meaning ``optmails'' within a url) occurs 1223 times. And yet, as I used to calculate probabilities for tokens, both would have the same spam probability, the threshold of .99.

That doesn't feel right. There are theoretical arguments for giving these two tokens substantially different probabilities (Pantel and Lin do), but I haven't tried that yet. It does seem at least that if we find more than 15 tokens that only occur in one corpus or the other, we ought to give priority to the ones that occur a lot. So now there are two threshold values. For tokens that occur only in the spam corpus, the probability is .9999 if they occur more than 10 times and .9998 otherwise. Ditto at the other end of the scale for tokens found only in the legitimate corpus.

I may later scale token probabilities substantially, but this tiny amount of scaling at least ensures that tokens get sorted the right way.

Another possibility would be to consider not just 15 tokens, but all the tokens over a certain threshold of interestingness. Steven Hauser does this in his statistical spam filter [8]. If you use a threshold, make it very high, or spammers could spoof you by packing messages with more innocent words.

Finally, what should one do about html? I've tried the whole spectrum of options, from ignoring it to parsing it all. Ignoring html is a bad idea, because it's full of useful spam signs. But if you parse it all, your filter might degenerate into a mere html recognizer. The most effective approach seems to be the middle course, to notice some tokens but not others. I look at a, img, and font tags, and ignore the rest. Links and images you should certainly look at, because they contain urls.

I could probably be smarter about dealing with html, but I don't think it's worth putting a lot of time into this. Spams full of html are easy to filter. The smarter spammers already avoid it. So performance in the future should not depend much on how you deal with html.

Performance

Between December 10 2002 and January 10 2003 I got about 1750 spams. Of these, 4 got through. That's a filtering rate of about 99.75%.

Two of the four spams I missed got through because they happened to use words that occur often in my legitimate email.

The third was one of those that exploit an insecure cgi script to send mail to third parties. They're hard to filter based just on the content because the headers are innocent and they're careful about the words they use. Even so I can usually catch them. This one squeaked by with a probability of .88, just under the threshold of .9.

Of course, looking at multiple token sequences would catch it easily. ``Below is the result of your feedback form'' is an instant giveaway.

The fourth spam was what I call a spam-of-the-future, because this is what I expect spam to evolve into: some completely neutral text followed by a url. In this case it was was from someone saying they had finally finished their homepage and would I go look at it. (The page was of course an ad for a porn site.)

If the spammers are careful about the headers and use a fresh url, there is nothing in spam-of-the-future for filters to notice. We can of course counter by sending a crawler to look at the page. But that might not be necessary. The response rate for spam-of-the-future must be low, or everyone would be doing it. If it's low enough, it won't pay for spammers to send it, and we won't have to work too hard on filtering it.

Now for the really shocking news: during that same one-month period I got three false positives.

In a way it's a relief to get some false positives. When I wrote ``A Plan for Spam'' I hadn't had any, and I didn't know what they'd be like. Now that I've had a few, I'm relieved to find they're not as bad as I feared. False positives yielded by statistical filters turn out to be mails that sound a lot like spam, and these tend to be the ones you would least mind missing [9].

Two of the false positives were newsletters from companies I've bought things from. I never asked to receive them, so arguably they were spams, but I count them as false positives because I hadn't been deleting them as spams before. The reason the filters caught them was that both companies in January switched to commercial email senders instead of sending the mails from their own servers, and both the headers and the bodies became much spammier.

The third false positive was a bad one, though. It was from someone in Egypt and written in all uppercase. This was a direct result of making tokens case sensitive; the Plan for Spam filter wouldn't have caught it.

It's hard to say what the overall false positive rate is, because we're up in the noise, statistically. Anyone who has worked on filters (at least, effective filters) will be aware of this problem. With some emails it's hard to say whether they're spam or not, and these are the ones you end up looking at when you get filters really tight. For example, so far the filter has caught two emails that were sent to my address because of a typo, and one sent to me in the belief that I was someone else. Arguably, these are neither my spam nor my nonspam mail.

Another false positive was from a vice president at Virtumundo. I wrote to them pretending to be a customer, and since the reply came back through Virtumundo's mail servers it had the most incriminating headers imaginable. Arguably this isn't a real false positive either, but a sort of Heisenberg uncertainty effect: I only got it because I was writing about spam filtering.

Not counting these, I've had a total of five false positives so far, out of about 7740 legitimate emails, a rate of .06%. The other two were a notice that something I bought was back-ordered, and a party reminder from Evite.

I don't think this number can be trusted, partly because the sample is so small, and partly because I think I can fix the filter not to catch some of these.

False positives seem to me a different kind of error from false negatives. Filtering rate is a measure of performance. False positives I consider more like bugs. I approach improving the filtering rate as optimization, and decreasing false positives as debugging.

So these five false positives are my bug list. For example, the mail from Egypt got nailed because the uppercase text made it look to the filter like a Nigerian spam. This really is kind of a bug. As with html, the email being all uppercase is really conceptually one feature, not one for each word. I need to handle case in a more sophisticated way.

So what to make of this .06%? Not much, I think. You could treat it as an upper bound, bearing in mind the small sample size. But at this stage it is more a measure of the bugs in my implementation than some intrinsic false positive rate of Bayesian filtering.

Future

What next? Filtering is an optimization problem, and the key to optimization is profiling. Don't try to guess where your code is slow, because you'll guess wrong. Look at where your code is slow, and fix that. In filtering, this translates to: look at the spams you miss, and figure out what you could have done to catch them.

For example, spammers are now working aggressively to evade filters, and one of the things they're doing is breaking up and misspelling words to prevent filters from recognizing them. But working on this is not my first priority, because I still have no trouble catching these spams [10].

There are two kinds of spams I currently do have trouble with. One is the type that pretends to be an email from a woman inviting you to go chat with her or see her profile on a dating site. These get through because they're the one type of sales pitch you can make without using sales talk. They use the same vocabulary as ordinary email.

The other kind of spams I have trouble filtering are those from companies in e.g. Bulgaria offering contract programming services. These get through because I'm a programmer too, and the spams are full of the same words as my real mail.

I'll probably focus on the personal ad type first. I think if I look closer I'll be able to find statistical differences between these and my real mail. The style of writing is certainly different, though it may take multiword filtering to catch that. Also, I notice they tend to repeat the url, and someone including a url in a legitimate mail wouldn't do that [11].

The outsourcing type are going to be hard to catch. Even if you sent a crawler to the site, you wouldn't find a smoking statistical gun. Maybe the only answer is a central list of domains advertised in spams [12]. But there can't be that many of this type of mail. If the only spams left were unsolicited offers of contract programming services from Bulgaria, we could all probably move on to working on something else.

Will statistical filtering actually get us to that point? I don't know. Right now, for me personally, spam is not a problem. But spammers haven't yet made a serious effort to spoof statistical filters. What will happen when they do?

I'm not optimistic about filters that work at the network level [13]. When there is a static obstacle worth getting past, spammers are pretty efficient at getting past it. There is already a company called Assurance Systems that will run your mail through Spamassassin and tell you whether it will get filtered out.

Network-level filters won't be completely useless. They may be enough to kill all the "opt-in" spam, meaning spam from companies like Virtumundo and Equalamail who claim that they're really running opt-in lists. You can filter those based just on the headers, no matter what they say in the body. But anyone willing to falsify headers or use open relays, presumably including most porn spammers, should be able to get some message past network-level filters if they want to. (By no means the message they'd like to send though, which is something.)

The kind of filters I'm optimistic about are ones that calculate probabilities based on each individual user's mail. These can be much more effective, not only in avoiding false positives, but in filtering too: for example, finding the recipient's email address base-64 encoded anywhere in a message is a very good spam indicator.

But the real advantage of individual filters is that they'll all be different. If everyone's filters have different probabilities, it will make the spammers' optimization loop, what programmers would call their edit-compile-test cycle, appallingly slow. Instead of just tweaking a spam till it gets through a copy of some filter they have on their desktop, they'll have to do a test mailing for each tweak. It would be like programming in a language without an interactive toplevel, and I wouldn't wish that on anyone.

Notes

[1] Paul Graham. ``A Plan for Spam.'' August 2002. http://paulgraham.com/spam.html.

Probabilities in this algorithm are calculated using a degenerate case of Bayes' Rule. There are two simplifying assumptions: that the probabilities of features (i.e. words) are independent, and that we know nothing about the prior probability of an email being spam.

The first assumption is widespread in text classification. Algorithms that use it are called ``naive Bayesian.''

The second assumption I made because the proportion of spam in my incoming mail fluctuated so much from day to day (indeed, from hour to hour) that the overall prior ratio seemed worthless as a predictor. If you assume that P(spam) and P(nonspam) are both .5, they cancel out and you can remove them from the formula.

If you were doing Bayesian filtering in a situation where the ratio of spam to nonspam was consistently very high or (especially) very low, you could probably improve filter performance by incorporating prior probabilities. To do this right you'd have to track ratios by time of day, because spam and legitimate mail volume both have distinct daily patterns.

[2] Patrick Pantel and Dekang Lin. ``SpamCop-- A Spam Classification & Organization Program.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.

[3] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz. ``A Bayesian Approach to Filtering Junk E-Mail.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.

[4] At the time I had zero false positives out of about 4,000 legitimate emails. If the next legitimate email was a false positive, this would give us .03%. These false positive rates are untrustworthy, as I explain later. I quote a number here only to emphasize that whatever the false positive rate is, it is less than 1.16%.

[5] Bill Yerazunis. ``Sparse Binary Polynomial Hash Message Filtering and The CRM114 Discriminator.'' Proceedings of 2003 Spam Conference.

[6] In ``A Plan for Spam'' I used thresholds of .99 and .01. It seems justifiable to use thresholds proportionate to the size of the corpora. Since I now have on the order of 10,000 of each type of mail, I use .9999 and .0001.

[7] There is a flaw here I should probably fix. Currently, when ``Subject*foo'' degenerates to just ``foo'', what that means is you're getting the stats for occurrences of ``foo'' in the body or header lines other than those I mark. What I should do is keep track of statistics for ``foo'' overall as well as specific versions, and degenerate from ``Subject*foo'' not to ``foo'' but to ``Anywhere*foo''. Ditto for case: I should degenerate from uppercase to any-case, not lowercase.

It would probably be a win to do this with prices too, e.g. to degenerate from ``$129.99'' to ``$--9.99'', ``$--.99'', and ``$--''.

You could also degenerate from words to their stems, but this would probably only improve filtering rates early on when you had small corpora.

[8] Steven Hauser. ``Statistical Spam Filter Works for Me.'' http://www.sofbot.com.

[9] False positives are not all equal, and we should remember this when comparing techniques for stopping spam. Whereas many of the false positives caused by filters will be near-spams that you wouldn't mind missing, false positives caused by blacklists, for example, will be just mail from people who chose the wrong ISP. In both cases you catch mail that's near spam, but for blacklists nearness is physical, and for filters it's textual.

In fairness, it should be added that the new generation of responsible blacklists, like the SBL, cause far fewer false positives than earlier blacklists like the MAPS RBL, for whom causing large numbers of false positives was a deliberate technique to get the attention of ISPs.

[10] If spammers get good enough at obscuring tokens for this to be a problem, we can respond by simply removing whitespace, periods, commas, etc. and using a dictionary to pick the words out of the resulting sequence. And of course finding words this way that weren't visible in the original text would in itself be evidence of spam.

Picking out the words won't be trivial. It will require more than just reconstructing word boundaries; spammers both add (``xHot nPorn cSite'') and omit (``P#rn'') letters. Vision research may be useful here, since human vision is the limit that such tricks will approach.

[11] In general, spams are more repetitive than regular email. They want to pound that message home. I currently don't allow duplicates in the top 15 tokens, because you could get a false positive if the sender happens to use some bad word multiple times. (In my current filter, ``dick'' has a spam probabilty of .9999, but it's also a name.) It seems we should at least notice duplication though, so I may try allowing up to two of each token, as Brian Burton does in SpamProbe.

[12] This is what approaches like Brightmail's will degenerate into once spammers are pushed into using mad-lib techniques to generate everything else in the message.

[13] It's sometimes argued that we should be working on filtering at the network level, because it is more efficient. What people usually mean when they say this is: we currently filter at the network level, and we don't want to start over from scratch. But you can't dictate the problem to fit your solution.

Historically, scarce-resource arguments have been the losing side in debates about software design. People only tend to use them to justify choices (inaction in particular) made for other reasons.

Thanks to Sarah Harlin, Trevor Blackwell, and Dan Giffin for reading drafts of this paper, and to Dan again for most of the infrastructure that this filter runs on.

Interesting Read (1)

rczyzewski (585306) | more than 11 years ago | (#5128126)

I'll be curious how spammers counteract this. Probably just send more and more to those who aren't filtered. I never thought of filtering all combinations of capitalization.
My users were complaining about spam again today. I walked over to discuss it with them and lo and behold, all stuff they signed up for except 2 klez emails.

Problem you say? (3, Funny)

termos (634980) | more than 11 years ago | (#5128130)

rm -fr ~/Mail
would do the trick.

Re:Problem you say? (-1, Troll)

Anonymous Coward | more than 11 years ago | (#5128177)

s/termos/fucking wanker.

Stop spam? (5, Interesting)

slykens (85844) | more than 11 years ago | (#5128135)

Filtering is nice, I've been using SpamAssassin with reasonable results for the last few months. It has nearly no false positives but has recently been missing more. Perhaps I should update.

Anyway, I've said a few times the only way to effectively stop spam is to make it more expensive to the companies having it done. Filtering, blocking ports, refusing mail from RBL'd hosts all helps, but it will not stop until it is fully against the law and people bring legal action to stop it.

Even people who are supposed to be clueful don't get it. I got spammed to buy EZ-Pass for the PA Turnpike. I sent a nastygram to the state DoT. The keyboard monkey responded that I should look closely at the email, that I signed up to receive it. If I had a dollar for every site that claimed I signed up with them I would be rich. What an idiot.

Re:Stop spam? (2, Informative)

Mournblade (72705) | more than 11 years ago | (#5128253)

Just curious - did you follow up w/ him to see *why* he thought you signed up to receive the spam? Is it possible that you inadvertantly allowed them to send you spam the last time you renewed your driver's license? I ask because most of the spams I get say "you signed up with one our partner sites" and i've always wanted to (but have been too lazy to) go back and see how far up the chain I could get.

Re:Stop spam? (1)

slykens (85844) | more than 11 years ago | (#5128304)

No, it was clearly spam. The email said it was from "playgolfnow.com" or some crap like that. It *wasn't* from the state.

His response was that I must have signed up for it as the email said so, and we all know that everything on the Internet is true. ;)

Re:Stop spam? (2, Insightful)

Anonymous Coward | more than 11 years ago | (#5128378)

You write to a state office, get a completely clueless reply and now ask for a legislative solution? You're quite the optimist, aren't you?

IN SOVIET RUSSIA (-1, Troll)

Anonymous Coward | more than 11 years ago | (#5128137)

BSD kills YOU!

Why can't we have legal restrictions on spam? (5, Interesting)

GGardner (97375) | more than 11 years ago | (#5128138)

Conventional wisdom seems to say that we can't outlaw spam. I don't understand why this is. My state has a do not call list. Since signing up for it, I have gotten zero phone solicitations, down from 2 or 3 a day. It is illegal to make a phone solicitation to a cell phone, and also, I get zero phone spams on my cell phone.

Some states, like California, have anti-spam laws, but curiously, they only cover spam sent from California to California. My state's telephone do-not-call list covers all calls to my number, no matter where they originate.

Now, I understand that there would be problems with international spam, but stopping domestic spam would be a huge boon to everyone. It seems like this legislation would be wildly popular, and easy to pass.

Re:Why can't we have legal restrictions on spam? (0)

Anonymous Coward | more than 11 years ago | (#5128200)

That list doesn't take effect until 4/1/03, jackass. Nice try, though, you obviously don't understand the situation at hand.

Re:Why can't we have legal restrictions on spam? (1)

pbrammer (526214) | more than 11 years ago | (#5128333)

What are you talking about? He said his state. The FCC one isn't even approved yet. Now who's the j.a.?

Re:Why can't we have legal restrictions on spam? (0)

Anonymous Coward | more than 11 years ago | (#5128276)

The probation board had mercy on me- now I just need to keep my numbers up.

thx

Because it's free. (2, Insightful)

Presto_slashdot (573879) | more than 11 years ago | (#5128288)

You probably get no spam to your home or cell phone because it's too expensive to set up a company in China and make phone calls to the US, just to get around the laws. Unfortunately, it *is* basically free to send spam mail. If they could call you for free from outside the US, they would be doing that too.

Re:Why can't we have legal restrictions on spam? (1, Insightful)

cheezedawg (413482) | more than 11 years ago | (#5128312)

Because the last thing we need in this country is the government telling us how and when we can send email or make a phone call.

Re:Why can't we have legal restrictions on spam? (0)

Anonymous Coward | more than 11 years ago | (#5128317)

I agree completely. People pay of bandwidth, espically with broadband connections, so it should be illegal to spam broad band connections. Just as it is illegal to telemarket cell numbers. The reason is, you are causing the person receiving the ad to pay for it. The same reasoning the used to exempt cell phones

Re:Why can't we have legal restrictions on spam? (0)

dracken (453199) | more than 11 years ago | (#5128358)

...I understand that there would be problems with international spam, but stopping domestic spam would be a huge boon to everyone...

Unfortunately, that is not a solution. It would take the slimy spammer worms one microsecond to evolve.
1 - Rent a computer and a T1 line (online ?) in XXX country
2 - Telnet/Run X desktop/Run XP remote desktop
3 - Click mouse and send spam
The only way to prevent this actively is by penalizing ISP's for open relays, prosecuting spammers based on their physical location. The best way for passively fighting it is to use spamassasin or razor and letting the companies know that any spam and users would abandon their products en masse.

AOL or Hotmail adopt? (3, Interesting)

twemperor (626154) | more than 11 years ago | (#5128140)

I really like this analytic approach. I've been using Hotmail's spam filtering, which merely removes e-mails from addresses not in my address book. While this is most of the time effective and very easy to implement, there does seem to be a major problem with false positives. ie I give my e-mail to someone, who's not in my address book.

Does anyone think AOL or Hotmail could start using such a system as the one outlined in the article?

Re:AOL or Hotmail adopt? (1)

b0r1s (170449) | more than 11 years ago | (#5128292)

Hotmail has more advanced filtering, as does yahoo, you just have to be aware of it to notice.

Specifically, if your address does not appear in the To: or Cc: fields, Hotmail will assume it is "Junk" by default, until you tell it otherwise. This stops spammers from creating a single alias on their server, and mailing only to that alias (a common way to get around the other blocks, which target emails with hundreds of recipients).

Yahoo does similar things, I've never used AOL so I don't know what they do.

For those who skipped the article. (5, Funny)

iamwoodyjones (562550) | more than 11 years ago | (#5128144)

The plan on what to do about is involves several different options. The first being, "Tar and feather those who spam"

Spam needs a global solution (1, Troll)

PhysicsGenius (565228) | more than 11 years ago | (#5128145)

Fighting spam as individuals isn't going to work, there are always going to be ways to get around filters and such. I think the problem needs a mathematical treatment at a global level and I would like to suggest a basis for that treatment.

First of all, let's realize that email is communication is data transmission. Spam is noise. This immediately brings to mind Claude Shannin's work on information and entropy. He made it very clear that noise can be reduced to a level that is O log(n) that of the information transmitted. This means that as we have more and more email out there, we are going to get more and more noise, unless we change something.

Let's go back to the definition of information. Basically, it's stuff that nobody knows about. If it is surprising to you, it is information (in non-technical language). That suggests that perhaps the information content (and therefore spam) could be reduced if, instead of secretively emailing our friends individually, we CC'd them on all our missives. This would make the amount of information lower (since people would be less surprised by our further revelations, having seen the foregoing matter) and therefore spam might even be eliminated.

Re:Spam needs a global solution (2, Interesting)

notsoanonymouscoward (102492) | more than 11 years ago | (#5128262)

This makes no sense to me... spam to me is primarily 1) friends sending stories, jokes, quizzes, etc... or 2) someone trying to sell you something. now if we all cc'd everyone on everything, we'd have even more spam by my 1st definition of spam, and it wouldn't affect the 2nd definition at all. how is this supposed to help?

Re:Spam needs a global solution (Global Solution) (5, Informative)

minas-beede (561803) | more than 11 years ago | (#5128318)

OK, signal and noise. What if the signal was all in one frequency band and the noise all in another. Problem separating them? No.

What if, in effect, a similar distinction held for spam in the transmission channel - that spam by itself selected a pathway to the recipient that was never used by the signal? Block that pathway and the spam never gets through.

Spam doesn't select a pathway but spammers do. If you could block relay spam at the open relays it would be dead. You can't, of course - the open relays are controlled by people who don't know the need to block spam. You know that, I know that. If you can't change the people then change the open relays (from the spammers' points of view.) Set up a system that looks like an open relay and stop the spam. An open relay honeypot.

I asked an operator of such a honeypot how he did last year:

> How did 2002 end?

From March 7 to December 26 2002, the total was:

235,624,232

Using one Pentium 90 he stopped spam to 235 million recipients. Think about that number when you see filter people reporting what they stop just for their own domains. This was spam to recipients all over, not simply to the honeypot operators domain: he operates at the relay level. He stopped 100% of the spam, no deception deceived him, no tuning was needed, no valid email was caught - it is perfect filtering. Perfect filtering - who else has that?

And you can do it at home on your DSL or cable connection (the guy above uses sendmail -bd, but Windows users have a program they can use):

http://jackpot.uk.net/

Yeah, I know, spammers are switching to open proxies. So, write an open proxy honeypot. That, too, will be 100% efficient. In addition you now are giving spammers reason to fear every open relay and every open proxy they detect. FEAR. The SPAMMERS have to scramble. They have to scramble and they have to show everything they do to overcome the technique - there is no stealth way to look for open relays and open proxies.

The problem is solved, it is a matter of implementation and of getting active systems everywhere in the net space (so there's no safe IP space for the spammers anywhere.)

Remember: A single Pentium 90, 235 million spam messages stopped in 10 months.

Fisiks Jenius == TH3 5UCK! (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#5128377)

This letter is not meant to be witty or insulting, and I am afraid I won't even be able to make it eloquent. But I will do the best I can to call for proper disciplinary action against Mr. Physics Jenius Genius and his peons. For practical reasons, I have to confine my discussion to areas that have received insufficient public attention or in which I have something new to say. It seems that no one else is telling you that it makes far more sense to draw an accurate portrait of his ideological alignment than to play fast and loose with the truth. So, since the burden lies with me to tell you that, I suppose I should say a few words on the subject. To begin with, Physics Jenius appears to have found a new tool to use to help him condition the public to accept violence as normal and desirable. That tool is McCarthyism, and if you watch him wield it, you'll indisputably see why I have a New Year's resolution for him: He should pick up a book before he jumps to the infantile conclusion that he has a "special" perspective on revanchism which carries with it a "special" right to ruin people's lives. In order to shoo Physics Jenius away like the annoying bug that he is, we must compile readers' remarks and suggestions and use them to establish a supportive -- rather than an intimidating -- atmosphere for offering public comment. And that's just the first step. Remember, Physics Jenius likes to pull the levers of defeatism and oil the gears of exclusionism. Such activity can flourish only in the dark, however. If you drag it into the open, Physics Jenius and his stooges will run for cover, like cockroaches in a dirty kitchen when the light is turned on suddenly during the night. That's why we must examine the warp and woof of Physics Jenius's undertakings. A more fundamental problem is that he thinks we want him to advocate his harangues amid a hue and cry as prurient as it is chauvinistic. Excuse me, but maybe after hearing about his morally questionable attempts to make things worse, I was saddened. I was saddened that he has lowered himself to this level.

If I recall correctly, if Physics Jenius is allowed to distort the facts, the implications can be widespread. Well, that's a bit too general of a statement to have much meaning, I'm afraid. So let me instead explain my point as follows: It's easy for us to shake our heads at his foolishness and cowardice. It's easy for us to exclaim that we should advocate concrete action and specific quantifiable goals. It's easy for us to say, "My observations are perhaps unique." The point is that it's easy for us to say these things, because that fact is simply inescapable to any thinking man or woman. "Thinking" is the key word in the previous sentence. Given the range and unpredictability of human behavior, it is quite possible that Physics Jenius uses the word "pathologicopsychological" without ever having taken the time to look it up in the dictionary. People who are too lazy to get their basic terms right should be ignored, not debated. Are his prognoses good for the country? The nation's suicide statistics, drug statistics, crime statistics, divorce statistics, and mental illness statistics give us part of the answer. These statistics should make it clear that Physics Jenius should not make me the target of a constant, consistent, systematic, sustained campaign of attacks. Not now, not ever.

He really struck a nerve with me when he said that the most valuable skill one can have is to be able to lie convincingly. That lie is a painful reminder that each rung on the ladder of anarchism is a crisis of some kind. Each crisis supplies an excuse for Physics Jenius to fuel inquisitions. That is the standard process by which subhuman, incomprehensible vigilantes undermine the individualistic underpinnings of traditional jurisprudence. Unsettling as that is, the more infuriating fact is that he has stated that he is known for his sound judgment, unerring foresight, and sagacious adaptation of means to ends. That's just pure credentialism. Well, in Physics Jenius's case, it might be pure ignorance, seeing that if Physics Jenius is victorious in his quest to trample over the very freedoms and rights that he claims to support, then his crown will be the funeral wreath of humanity. In the end, we have to ask, "Why can't we simply agree to disagree?" I mean, if you don't think that the consequences of his wretched double standards, particularly from a moral point of view, are not favorable, then think again.

Taking that notion one step further, we can see that we need to look beyond the most immediate and visible problems with Physics Jenius. We need to look at what is behind these problems and understand that if the only way to hinder the power of temperamental warmongers like Physics Jenius is for me to throw in the towel, then so be it. It would indeed be worth it, because we are observing the change in our society's philosophy and values from freedom and justice to corruption, decay, cynicism, and injustice. All of these "values" are artistically incorporated in one person: Physics Jenius Genius. Physics Jenius is a lifelong member of the Church of Power-drunk Priggism. That concept can be extended, mutatis mutandis, to the way that he is trying to brainwash us. He wants us to believe that it's politically incorrect to exert a positive influence on the type of world that people will live in a thousand years from now; that's boring; that's not cool. You know what I think of that, don't you? I think that I and Physics Jenius part company when it comes to the issue of egotism. He feels that his tactics are our final line of defense against tyrrany, while I maintain that at no time in the past did militant individuals shamble through the streets of cities, demanding rights they imagine some supernatural power has bestowed upon them. Although it requires risk, commitment, and follow-through to build an inclusive, nondiscriminatory movement for social and political change, someone has to be willing to investigate Physics Jenius's homicidal principles, ideals, and objectives. Even if it's not polite to do so. Even if it hurts a lot of people's feelings. Even if everyone else is pretending that Physics Jenius can achieve his goals by friendly and moral conduct.

The underlying message is that I challenge him to point out any text in this letter that proposes that clericalism can quell the hatred and disorder in our society. It isn't there. There's neither a hint nor a suggestion of such a thing. Physics Jenius's stories about animalism are particularly ridden with errors and distortions, even leaving aside the concept's initial implausibility. The confusion that Physics Jenius creates is desirable and convenient to our national enemies. Or, to express that sentiment without all of the emotionally charged lingo, if I may be so bold, Physics Jenius's sophistries are designed to retard the free and natural economic development of various countries' indigenous population. And they're working; they're having the desired effect. As a matter of fact, if I didn't think Physics Jenius would alter laws, language, and customs in the service of regulating social relations, I wouldn't say that prudence is no vice. Cowardice -- especially his reprehensible form of it -- is. I recently heard him tell a bunch of people that going through the motions of working is the same as working. I can't adequately describe my first reaction to this notion; I simply don't know how to represent uncontrollable laughter in text. My goal for this letter was to listen to others. Know that I have done my best while trying always to remove the misunderstanding that Mr. Physics Jenius Genius has created in the minds of myriad people throughout the world. Let an honest history judge.

Nice idea but. (1)

Fnagaton (580019) | more than 11 years ago | (#5128146)

As long as it stops all the emails I get from Ubi Lumjobo trying to get me to accept $21.5m from South Africa then I'll be happy. :) Or the people that try to make my breasts larger... Or viagra...

hopeless (-1, Informative)

tps12 (105590) | more than 11 years ago | (#5128148)

I'm sort of impressed to see people still plugging away at the Bayesian spam filter problem. It's admirable to see that kind of preserverance in programmers.

For those coming late to the story, Joel Sponsky demonstrated in his well known column [joelonsoftware.com] recently that Bayesian filtering of spam is an intractible problem. Until we have quantum computers, we're stuck with black lists, which work pretty well anyway.

But keep plugging away guys. Who knows, maybe Joel's wrong.

Re:hopeless (5, Insightful)

Kallahar (227430) | more than 11 years ago | (#5128242)

Yeah, 2.5 per 1000 getting through is a proof that his ideas are obviously flawed. Having a working system is the best proof that an idea works :)

Travis

Re:hopeless (0)

Anonymous Coward | more than 11 years ago | (#5128249)

So you'd rather have 1000 out of 1000 pieces of junk mail over 2.5 out of 1000. Way to keep things in perspective! (Actually, what you were doing is trying to show off what little intellect you have.)

Re:hopeless (4, Insightful)

ajs (35943) | more than 11 years ago | (#5128263)

Everyone but the folks at SpamAssassin have been focusing on the idea that any one technique for identifying spam is doomed to diminishing returns.

Over at SpamAssassin, they've been busily creating a system that collects "good enough" tests by the dozens and uses them to collectively score a message and determine its general "spamishness". The system relies on a complex scoring system that is determined, not by the whim of human programmers, but on the results of a genetic training system that pits one set of scores against another until equilibrium is reached for a given set of example spam and non-spam.

See my other post here for how Bayesian filtering will be used to allow this system to feed back on itself and improve as it sees more of your spam and non-spam....

Re:hopeless (1)

mjh (57755) | more than 11 years ago | (#5128295)

Until we have quantum computers, we're stuck with black lists, which work pretty well anyway.

... or software managed whitelists [tmda.net] . This software assumes that everyone is blacklisted until they can prove otherwise. This system will work until spammers start using real, working return mailboxes. At which point, 99% of the battle will have been won.

Re:hopeless (1)

rabidcow (209019) | more than 11 years ago | (#5128336)

For those coming late to the story, Joel Sponsky demonstrated in his well known column [joelonsoftware.com] recently that Bayesian filtering of spam is an intractible problem.

Where? There's no mention of Bayesian anything on that page. The closest thing I can see is "Bad Spam Filters," which is about a different kind of filter.

Re:hopeless (1)

blakestah (91866) | more than 11 years ago | (#5128362)

I think you are wrong.

Bayesian filtering merely using a statistically optimized method to duplicate the classification of the user for which it is working. If trained on enough of YOUR email, it will work exceedingly well in classifying YOUR future email.

Put another way, I tried blacklisting filtering, and fixed token filtering, and performance was pretty poor. In contrast, I am quite happy with bogofilter's performance. But, of the various methods, only Bayesian filtering takes the preferences of the individual user as its primary basis for sorting email.

BTW, your link is pretty much useless in showing why Sponsky may or may not think Bayesian methods are intractable. He more or less just rants that draconian MTA based filters are doing harm - I agree with him. But the word Bayesian doesn't even appear on the page to which you linked. And that makes you a Troll.

Re:hopeless (1)

Malcontent (40834) | more than 11 years ago | (#5128372)

"Who knows, maybe Joel's wrong."

Has he ever been wrong before? I have read a few of his writing and I wasn't impressed. It's not like he is some sort of a deity or something.

I am curious as to why you would site him as some sort of an infallible source.

The Irony 'n stuff (1)

ackthpt (218170) | more than 11 years ago | (#5128151)

how spam will evolve

The irony is, Spam evolves, yet people still fall for spam. If it didn't work, we'd have seen the last of it years ago.

I've downloaded MailWasher and have just started looking through it (so I don't know what it uses for filtering.) I've noticed a lot of the recent junk is html with the ploy spelled among comment tags, i.e.:

<-- Job Offer --> Biggie <-- for web --> your <-- designer --> doohickey

Are any of the filters able to handle these?

Lastly, has anyone ever bother to combat spam with spam? I.e. send out a letter explaining what people are likely to get, aside from their credit card charged out to a pr0n site, sugar pills, photocopies of something you can find in any library, identity theft, etc. ?

Spam and AI (5, Funny)

cybermace5 (446439) | more than 11 years ago | (#5128154)

And the conflict rages on. The better filters we use, the sneakier the spam artists get. Now we're developing self-modifying algorithms to detect and kill spam, and I'm sure the spammers are developing self-modifying algorithms to craft filter-tricking spam.

How long before the back-and-forth of spam filters and spam crafters becomes self-aware? It's got to happen. Eventually the spam filters will become a skeptic consciousness that *feels* its way through spam and spots the phoneys, and the spam crafters will become a persuasive consciousness that tries to think and write as a close friend or relative.

better than legislation (5, Interesting)

Rojo^ (78973) | more than 11 years ago | (#5128160)

This is a wonderful tool that is being developed. However, I don't think any one tool will succeed in eliminating spam. From a spammer's point of view, if my income depends on messages making it through filters, by damn I will bypass those filters by whatever means I can. These assholes send penis enlargement advertisements to my mother -- If her gender doesn't stop them, neither will an email filter.

On a different subject, in a story about a week ago, someone posted a link to a peer-peer network of spam emails for MS Outlook available at http://www.cloudmark.com that will trap a significant amount of emails based on (and this is overly simplified, of course) users' votes. Does such a solution exist in the open source world?

Re:better than legislation (0)

Anonymous Coward | more than 11 years ago | (#5128365)

>> These assholes send penis enlargement advertisements to my mother

The referal business from her is worth billions. If anyone's seen a lot of cock, it's your mother.

Spam of the Future! (1, Informative)

zulux (112259) | more than 11 years ago | (#5128165)

The real scarry part of the article is about, what he called, "Spam of the Future". It's really interesting. Basically, is a spam message that has a lot of seemingy normal text, that won't get caught in the spam filter. Because it IS normal text. It's then followed by a link - ususally to a porn site.

Here is your opt-in FREE! porn! [goatse.cx]

Re:Spam of the Future! (0)

Anonymous Coward | more than 11 years ago | (#5128282)

The spam of the future will be sales pitches for flying cars.

Spamassassin and ENDING spam.... (5, Informative)

ajs (35943) | more than 11 years ago | (#5128179)

The latest development Spamassassin has an interesting application of Bayesian filtering. Basically, it takes all of SA's existing heuristics, uses that to develop a sense of what is and is not spam, and then pumps the results through a Bayesian filter that learns from these messages.

As with any other SA test, no single element of the chain is trusted enough to definitively call something spam, but if a message would have squeeked through before, this new filter can put the final nail in its coffin through word analysis against previous spam.

So, why did I use a subject about "ENDING spam"? Because one of the tools that spammers have is SA itself. They can use it to score their messages and determine how "spamish" it is. The problem now is that each SA installation will have subtly different scoring, and the message may be "ok" according to the spammer's version, but my version has a better sense of the mail that *I* get.

SpamAssassin is definitely a tool worth checking out if you have not already. Install it in daemon mode (spamd) and then use "spamc -f" in your procmailrc or the equiv for your MTA.

Very nice tool, and a real time-saver for me.

Help the spammers. No, really. (1, Interesting)

Anonymous Coward | more than 11 years ago | (#5128189)

I have just set up a system which parses spam email, locates any Web addresses, strips out the parameters, and then visits the Web site. Just think if we ALL did this. So rather than the poor spammer only getting a .001% hit rate, they get an astounding 100% hit rate. So 1 million emails sent, 1 million instant Web page hits. And it is not like they can complain about this, after all they are ASKING for the hits.

Even better is that my domain gets multiple spams from the same company.

Re:Help the spammers. No, really. (1)

qengho (54305) | more than 11 years ago | (#5128368)

I have just set up a system which parses spam email, locates any Web addresses, strips out the parameters, and then visits the Web site. Just think if we ALL did this.

Heh. I use a program called SpamFire [matterform.com] to filter my mail. No Bayesian stuff yet, but it has a Revenge menu with an item called Bug the Web Bugs. This scans your spam mail for web bugs, then opens a page in your browser that sends either random garbage or your choice of message, every two seconds, to the server specified by the web bug.

Performance (1)

FrostedWheat (172733) | more than 11 years ago | (#5128194)

2.5 per 1000

So it catchs 2 or 3 of every 1000 spam messages. My worry would be how many non-spam messages it catchs?

I'd hate it to tag any personal mail as spam :)

Re:Performance (1)

avi33 (116048) | more than 11 years ago | (#5128279)

zero, in theory.

I don't remember the details of the original rules he used to implement this, but something along the lines of "it's acceptable to have a couple pieces of spam slip through, but completely unacceptable for non-spam to be blocked." ...and goes on to describe how a set of flags can be used to tag suspected spam, and ask if it's real or not...essentially building a personalized set of rules based on user preferences.

Re:Performance (2, Insightful)

ergo98 (9391) | more than 11 years ago | (#5128330)

And this is the clincher in any of these spam filters: If the filter automatically deletes messages that it identifies as spam (which could be legitimate business proposal or job offer, for example) then a false positive would be incredibly destructive. If it doesn't automatically delete but instead you periodically go through all of the messages, then it's of little value as you're forced to manually filter the spams anyways. The irony is that the better it is at identifying spams, the more destructive a false positive would be as you casually scan through and delete large clusters of supposed spam.

Personally I think the author of the paper is a bit idealistic in ways when they say "If we can write software that recognizes their messages, there is no way they can get around that". Well then again maybe they aren't: Saying "if we can...recognize their messages" is a pretty wide net presumption, and of course the following conclusion follows, however the real question is "can we realistically make software that can effectively identify with zero incidences of false positives". For people who email between themselves and one or two other people on one subject that isn't a problem, but I suspect that statistical word usage analysis wouldn't be quite as successful for someone with a more disparate mail usage.

Bayesian filtering (4, Interesting)

blakestah (91866) | more than 11 years ago | (#5128220)

The basics are, you take all good mails, and create a database of words used in them. Make a different database for spam mails. Then, for each incoming mail, compare to each database, and classify as spam or non-spam.

The algorithm starts out conservative, ie: you get most of the mail classified as good. For each "good" email that is spam, you manually re-classify it.

Then, after a few weeks, the filter does all the work. It is basically using word-databases to compare emails and classify them the way you, the user would. Periodically you will receive another spam email, then you re-classify it, and never see an email like it again (in your inbox).

Bogofilter and CRM114 are among the more successful efforts so far, but there are many. And they are FAR more successful than blacklist/whitelist/fixed token comparison filters. But Bayesian filtering is just a near optimal way to replicate the classification of the user, which is also why it works so well.

Spam only cost-ineffective with ISP-level filters (5, Insightful)

PseudoThink (576121) | more than 11 years ago | (#5128235)

Spam filters are great, but it seems that only the Net-savvy are using them. Savvy users aren't the people spammers are making all their money from--they are making money off the naive and inexperienced users. These users aren't going to go out and install the latest Bayesian filters on their system, and the major email readers won't (and probably shouldn't) come with them automatically activated.

To make spam cost-ineffective for the spammers, we've got to stop it (or flag it) before it gets to the end-user. It would obviously be a mistake to allow ISP's to automatically delete all email that fails their spam filters, but I think it would be appropriate for them to include something in the headers flagging such email as probable spam. Then future email readers could detect this header and handle it gracefully, like moving it to a "spam" folder on the user's machine. Once this happens and Grandpa no longer gets email asking him to test the latest Viagra alternative, spam may become a thing of the past.

filtering effectiveness (5, Insightful)

qoncept (599709) | more than 11 years ago | (#5128239)

I think I speak for everyone when I say false positives are the only real hinderance to the filtering of spam. I get roughly 20 emails a day, 75% of which are spam. If one of them slips past the filter and I see it, it doesn't bother me so much. Spam is no longer a problem. What is an absolute necessity, though, (and probably less so for me than other people) is that none of my legitimate email is filtered as spam. I'd rather have 100 spams filtered improperly than one legit email.

Actually - (4, Interesting)

sean.peters (568334) | more than 11 years ago | (#5128343)

You don't speak for everyone. On the contrary, I think that most people realize that e-mail delivery isn't guaranteed - and therefore they expect that truly vital messages will need to be backed up with a phone call or some other means, to be sure the message was delivered.

I would prefer to lose one or two legitimate mails in return for a virtually zero rate of missed detections.

Sean

Obligatory plug for TMDA (4, Informative)

Silas (35023) | more than 11 years ago | (#5128252)

I'm really excited about all of the neat stuff happening with Bayesian filtering and related technologies, but I just wanted to put in a plug for TMDA [tmda.net] , Tagged Message Delivery Agent, which uses a whitelist-centric strategy. Since I began using it, the amount of spam I have to look at is virtually at zero. If you haven't read about it yet, check it out.

the master of spam. (0)

bigbinc (605471) | more than 11 years ago | (#5128275)

Paul Graham, great book, ANSI lisp, if there is one person that knows spam it is this guy.

Content filtering means the spammers have won (0)

Anonymous Coward | more than 11 years ago | (#5128286)

The victims are expending considerable amounts of individual CPU time to classify mail which they must read (albeit mechanically).

Rejecting mail from IP addresses known to send spam (or teergrubing to tie up spammer resources) puts the burden back on the spammers, where it belongs.

Absent effective out-of-band defenses (such as the courts and the legal system), wasting money on filtering is a foolish effort just to benefit a few innocent sources who choose to share an IP address with spammers. And if they pay money to a spam-tolerant ISP, are they really innocent ?

Standard Spam API (2, Insightful)

Anonymous Coward | more than 11 years ago | (#5128290)

I have been quite excited with all the new ideas being put to use in fighting spam recently. Unfortunately, whenever I find one that is implemented, it doesn't work with my mail server or my client. It seems like there should be a standard API that spam filters could implement, (using soap or xml-rpc or something), so that the various mail servers and email clients could use a single plug-in to add spam filtering. This would allow the people who are good at spam filter code to focous on that one problem, and the people who are good at writing email plugins and GUI code can do what they are good at.

If you want to stop spam, tax email (1)

jlowery (47102) | more than 11 years ago | (#5128301)

It doesn't have to be much, just 1/8 cent per email or so. That's all it would take.

now THIS is a true geek (4, Funny)

Anonymous Coward | more than 11 years ago | (#5128305)

>Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam...

Spoken like a true geek.

My Spam Algorithm: (1, Offtopic)

eforhan (631605) | more than 11 years ago | (#5128310)

Spicy SPAMBURGER

Ingredients:
1 (12-ounce) can SPAM® luncheon meat, cut into 4 slices
1 green pepper, cut into thin strips
1 small onion, thinly sliced
½ cup MIRACLE WHIP or MIRACLE WHIP LIGHT Salad Dressing
½ teaspoon ground red pepper
4 hamburger buns, split
Lettuce and tomato slices (optional)

Instructions:
Cook SPAM®, green peppers and onions in large skillet 5 minutes, stirring occasionally. Mix MIRACLE WHIP and red pepper. Spread evenly on hamburger buns. Place SPAM®, peppers and onions on bottom halves of buns. If desired, top with lettuce and tomato and cover with top halves of buns. Makes 4 sandwiches.

***
I lied. I hate spam of any kind. Bravo Anti-spammers.

-Eric

Treating the symptoms, not the problem... (2, Interesting)

Anonvmous Coward (589068) | more than 11 years ago | (#5128340)

I hope you all realize that at best you're buying time, not solving the Spam problem. It won't take long for these guys to find ways through the filter.

The problems need to be solved on a different level. The problem is not the messages themselves, it's that people are allowed to send these messages to anybody they want without any real challenges as to their authenticity.

Let me explain how I have things set up right now, and hopefully my stance on this issue will be a little clearer. All my messages come into the same mailbox. I have a bunch of email aliases, though. If I sign up for Slashdot, for example, then I create a new alias like 'slashdot@insertdomainnamehere.com'. I then add that email address into my 'email allowed' list so that it gets funneled through into a visible folder. If that address gets abused, I shut down the email alias.

My personal friends are treated a little differently. Once they email me, I add their address into my list of friends, and they get put into a friends folder. I treat this differently than a registration place because my friends all need one address to contact me at, I don't mind them sharing it with each other. If my address changes, then their messages still get through.

I plan on going farther down the road. I'm going to give people an email address, and when they email it they get an automated message with instructdions on how to 'request permission' to send me email. When permission is granted, they don't get that message anymore. It basically means that the only messages that get through to me are the ones that have a human behind them to read the response and then go through the proper channels to reach me.

I'm not claiming to have done anyting new here. I'm basically mimicking the way IM works, and I'm doing it without having to do anything real fancy. Outlook's Rules Wizard is doing quite a bit of the work here. But since people actually have to take the time to request my authorization, it means that it's a message meant for ME as opposed to a message meant for anybody who's out there. With an approach like this, it'd be a lot harder for spammers to get through.

Difference with MacOS X 10.2's Mail.app? (2, Informative)

tbmaddux (145207) | more than 11 years ago | (#5128346)

This is all quite interesting from a technical standpoint, but what can I gain as a user of Mail.app in MacOS X 10.2 (Jaguar) from this? My Junk filter catches spam and tosses it into a separate folder. I occasionally go through it and send the spam off to SpamCop. What I like about Mail.app is that it's easy to keep training by marking as Junk (for spam it failed to identify) or Not Junk (for occasional false positives). It seems to work well and doesn't require a lot of interaction from me except for interacting with SpamCop (my choice).

It doesn't catch all the spam, and it occasionally has a false positive. This will be true of any spam filter we implement, because spam continues to change. SpamAssassin runs on some of the mailservers I connect to, but it tends to perform worse than Mail.app. So until we can get each user's spam filter customized at the server, spam identification is going to have to stay client-based. It sounds like Paul Graham's tools are getting a little more efficient, but does any of this make a big difference for the end user?

This is not a new idea... (0)

Anonymous Coward | more than 11 years ago | (#5128347)

http://lsa.colorado.edu/papers/dp1.LSAintro.pdf

My favorite Baysean token (1)

daves (23318) | more than 11 years ago | (#5128371)

... is "0D". Some HTML editor out there, apparently only used by spammers, encodes it's output with an ASCII "0D" at the end of each line. These spams get the highest scores I've seen.
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...