More on Bayesian Spam Filtering

michael posted more than 11 years ago | from the snake-eyes dept.

Spam 251

michaeld writes "The "Bayesian" techniques for spam filtering recently publicized in Paul Graham's essay A Plan for Spam doesn't actually seem to have anything Bayesian about it, according to Gary Robinson (an expert on collaborative filtering). It is based on a non-Bayesian probabilistic approach. It works well enough, because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done. The problem interested Robinson, and he posted his thoughts about trying to fix the problems in the Graham approach, including adding an actual Bayesian element to the calculations."

best method .. is ..to .. (0)

Anonymous Coward | more than 11 years ago | (#4276134)

block all, and then let only what you KNOW you need in. it's the only method that will ever work right.

Spam spam spam (1)

Dynamoo (527749) | more than 11 years ago | (#4276136)

Well I guess spam comes in different size tins sometimes, and with different labels so you can tell the spam apart. I like Hot and Spicy Spam. Mmmm.

Of course, the 1% of non-spam that accidentally gets filtered out is just collateral damage (except it's normally something really important like a tin of processed peas or something).

I'm going to sit down now and take some more HGH.

Re:Spam spam spam (1)

docbrown42 (535974) | more than 11 years ago | (#4276371)

Well I guess spam comes in different size tins sometimes, and with different labels so you can tell the spam apart. I like Hot and Spicy Spam. Mmmm.

Bloody Vikings!


docbrown.net [docbrown.net] NEW!
Graphic Design, Web Design, Role-Playing Games...all the good stuff

But what about pr0n spam? (0)

Anonymous Coward | more than 11 years ago | (#4276142)

Can said "filter" filter out non-pr0n spam, while keeping the sweet sweet pr0n spam?

Re:But what about pr0n spam? (0)

Anonymous Coward | more than 11 years ago | (#4276506)

In other words what we need is a sort of "facial recognition system."

Why filter spam? (0)

Anonymous Coward | more than 11 years ago | (#4276154)

Spammers have to make money, too. Is it so hard to click on a link or two a day to help put food into the mouths of the man's children? Who are you, Scrooge? Help the man feed his kids this Thanksgiving.

Re:Why filter spam? (0)

Anonymous Coward | more than 11 years ago | (#4276524)

Nothing wrong with people making money.
There is plenty wrong with stealing it.

"Anti-spammers" are not starving anyone.
They are, however, opposing theft.

spam (1)

sstory (538486) | more than 11 years ago | (#4276163)

Someone came up with this idea recently, and I like it, so I've been repeating it. Instead of illegalizing spam, which i would love if it worked, but it won't, require spammers to indicate the nature of the email--anonymous, commercial, with a word or such in the subject line, which can then be filtered by individual recipients according to their desires. It would not be as free-speech-limiting as banning spam, and spam would die out due to ineffectiveness once most everyone filtered it.

Re:spam (0)

Anonymous Coward | more than 11 years ago | (#4276213)

You're mighty fucking optimistic if you think people are going to tag their mail so that you don't have to read it. May as well sit around and hope they all just decide to stop spamming.

optimistic (2)

McFly777 (23881) | more than 11 years ago | (#4276508)

I think the original poster's point would be to make commercial e-mail illegal unless properly tagged. That way an untagged spam could be handed over to the FBI and treated like wire-fraud or something.

Big problem would be prosecuting the spammer. Either they would all move overseas or the court would be so backlogged as to become ineffective.

Re:spam (2)

Cyno01 (573917) | more than 11 years ago | (#4276214)

yeah, theyre all ready supposed to indicate if its pr0n spam by specifying in the subject, hot sluts inside, or whatever, but they ussually don't, and there isn't a really good way to enforce this, 9 times outa 10 when i get an e-mail from Joey, i know 3, and the subject is hey, or hows it going or something, the actual e-mail is pr0n

I still think passive euthanasia is the best way. (2, Flamebait)

tcc (140386) | more than 11 years ago | (#4276215)

Why is such a simple problem that pisses off 99.9% of the population is so hard to manage on a global scale? I mean, EVERYONE is pissed off at getting spammed, everyone would LOVE legislation to sodomize local spammer with a baseball bat, oversea is a different problem but country/continent-wide spam is 1/2 of my problem and can be easily be taken care of with proper legislation. For once a restrictive legislation would get 99% support... you don't see that everyday. like I mentionned before, I don't get our politicians, they say they work for us, they try to find clever ways to tax us, remove control that we used to have and all, but something on which they would get unprecedented support, they are simply sitting on the issue...

Until politicians will be fed up and people will actually get SUED for spamming (for once you could have a good reason to sue real bad guys) nothing will change.

Yes I know in SOME states it's beginning, so for local spam in a few years from now I think legislation will make it's way and we'll be able to look in our mailbox and stop having TD waterhouse spamming when you already have an account with them, etc.

The other problem now is oversea spamming, especially coming from China/Taiwan. I mean.. I don't read chineese, I don't plan on buying that #.#" something oversea, so why do they spam us like that? I never get it, but I'd be all for passive euthanasia (i.e. ban their IP at router level) and if this is bad for buisness or relations or whatever, well MAYBE they will do something about it.

Here where I work, it's simple, one spam, I ban a whole class straight off the servers, if one day I get a call because someone couldn't reach us (if they really need to reach us, we have a phone anyways!) I'll be sure to mention him Why. too bad this is not happening at the backbone level, because some people would get their act together fast and apply a legislation globally.

Re:I still think passive euthanasia is the best wa (0)

Anonymous Coward | more than 11 years ago | (#4276384)

Politics in the US is not about the will of the people; it is about the will of the corporations that have the money for lobbying their agenda. The politicians will continue to ignore the people unless the resistance from the people corsses a certain threshold (in this case, when people are bothered enough by spam to ignore other issues that the politiona in question might be working on).

Re:I still think passive euthanasia is the best wa (3, Informative)

ivan256 (17499) | more than 11 years ago | (#4276405)

For once a restrictive legislation would get 99% support... you don't see that everyday. like I mentionned before, I don't get our politicians, they say they work for us, they try to find clever ways to tax us, remove control that we used to have and all, but something on which they would get unprecedented support, they are simply sitting on the issue...

Perhaps the problem is that the law would gain them less votes then a few hundred thousand dollars in campaing financing would. A large portion of the population isn't online, and a large portion of those who are don't care about spam, so your politician doesn't care either.

Since this is such a trivial technical problem to solve, it's not really a big deal either way. I daily reduce 800 spam messages to five or six that make it through to my inbox just using procmail scoring, and I haven't had a false positive in years. I spend five minutes updating my procmailsc every six months to keep it effective. I suppose that I could use an automated system to generate my score file similar to what Paul Graham described, but when I only spend ten minutes a year updating my rules, it's going to be alot of years before it was faster to have written all that code. No need for sweeping legislation.

Re:I still think passive euthanasia is the best wa (1)

ch-chuck (9622) | more than 11 years ago | (#4276411)

Ah, all they have to do is say something about restricting free speech and all the angry ballbats go limp. Spamassassin: works for me.

anti-spam laws (2)

McFly777 (23881) | more than 11 years ago | (#4276420)

While in many respects I agree that "There oughta be a Law" against spam, there are some problems with that approach. Not the least is that generally a social solution is much better (or at least has less side effects) than any law that a government will enact.

Laws have the distinct problem of either going too far (false positive) or being too weak and thereby legitimizing the spam that would manage to work through the loopholes. Taken to the extreme that seems to commonly occur in the US legal system, I can envision spammers suing ISPs for blacklisting their "legit per US act ####" spam.

I would much rather statistical methods such as are being discussed. This combined with "whitelist" methods seem to work very well by all accounts.

Re:I still think passive euthanasia is the best wa (1)

_Spirit (23983) | more than 11 years ago | (#4276542)

ban their IP at router level Oi, remind me to start running when you consider *active* euthanasia

Tutorial on Bayesian Inference (5, Informative)

rbrito (37104) | more than 11 years ago | (#4276216)

The timing of this article seems impecable, since I am myself trying to learn about Bayesian Statistics.

I am a Computer Science student [ime.usp.br] studying Computational Biology [ime.usp.br] (more specifically, Sequence Alignments) and while I have a bit of background on Classical Statistics, I was (and still am) completely ignorant about Bayesian Statistics.

It is only now that I'm trying to learn about Hidden Markov Models and its applications to Sequence Alignment that Ifinally decided to learn the basic hypothesis about Bayesian Statistics and how it differs from the hypothesis made by the Classical Statistics.

During my searches for finding introductory material on Bayesian Statistics, I found this course page [arizona.edu] which has some nice introductory notes, including Bayesian Statistics.

I hope that other people find this resource as useful as I did.

Post your results here (5, Interesting)

Jeffrey Baker (6191) | more than 11 years ago | (#4276220)

I'd like to head the results of anyone who has implemented one of these probabilistic filtering systems. I implemented a modifed version of Paul Graham's system and so far it kicks ass. So far it has trapped over 600 spams without any false positives. I receive almost 100 spams a day and over the last week I have generally only had to delete one or two by hand. The rest go directly to jail.

I'd like to hear about modifications to this system. I removed Graham's doubling of "good" word frequencies, and I trained my filter using digrams. I also tried all the various methods supplied by the program "rainbow", with good results, but the implmentation was too slow and klunky to place in the middle of my email delivery system. What are other possible modifications?

Re:Post your results here (3, Insightful)

ajm (9538) | more than 11 years ago | (#4276275)

Just out of interest what's your code written in and would you consider posting it?

Re:Post your results here (5, Interesting)

Jeffrey Baker (6191) | more than 11 years ago | (#4276353)

I hacked it together in Perl, to make use of the Berkeley DB interfaces and the MIME parsing modules. Took about 30 minutes. I'm working on a C library that could be linked into mutt or pine or whatever, but I'm finding the available MIME code in C cumbersome.

You can grab the source here [saturn5.com], but it is specific to the exact way that my mail gets delivered (via offlineimap into maildirs).

Re:Post your results here (3, Interesting)

kwerle (39371) | more than 11 years ago | (#4276282)

I implemented Paul's system without the changes you mentioned, and am seeing >95% success (and climbing). 0 false positives. I will be submitting it to sourceforge this week.

Re:Post your results here (2)

Jeffrey Baker (6191) | more than 11 years ago | (#4276336)

So have you been retraining the system as you get more spam, or did you train it initially and leave it that way. How large is your training set?

Details! My training set was 300 spams and 3500 not-spams. With digrams, my filter traps 618 out of 621 spams in my spam folder, which is 99.5%

Re:Post your results here (3, Interesting)

kwerle (39371) | more than 11 years ago | (#4276593)

So have you been retraining the system as you get more spam

I continue to train.

or did you train it initially and leave it that way. How large is your training set?

I started off with a base.

Details! My training set was 300 spams and 3500 not-spams.

I started with a little more than 300 spam, and around 1000 valid messages.
My count is now:
Good messages read: 1194
Bad messages read: 644

That's because I only train on deleted mail, and I don't tend to delete my mailing lists except for once a month or 2...

With digrams, my filter traps 618 out of 621 spams in my spam folder, which is 99.5%

Against my start set, I nailed about 97%, including refiling 2 false positives from my old anti-spam system as being not spam. I've noticed that the system is really good at nailing stuff it already knows about, but the learning curve is a little steep for 'new spam types'. Still, I'm pretty happy with it.

Re:Post your results here (1)

saskboy (600063) | more than 11 years ago | (#4276337)

I don't understand the filtering software that I can download to implement this system. Anyone have a link?

Re:Post your results here (1)

aero6dof (415422) | more than 11 years ago | (#4276554)

how about hybridizing it with spamassassin to help mark email for varying levels of analysis. You could use spamassassin with a low threshhold to do an initial low resource pass and then do a high resource pass with your system. Alternately, you could try to generate spamassassin rules from your database to help with that first pass filter.

Re:Post your results here (2)

Jeffrey Baker (6191) | more than 11 years ago | (#4276630)

What you said is actually inverted, since my system is way faster than spamassassin. I can do 15 mails/second with my Perl code. My half-implemented C code does over 600/second. Spamassassin seems to take about 5 seconds per mail on the same hardware.

The proof of the pudding... (5, Interesting)

ajm (9538) | more than 11 years ago | (#4276223)

...is in the eating. I think the same applies to spam. Paul showed, to his satisfaction, that the technique he used worked for his samples. Gary proposes some changes that would improve the filter's accuracy, but does not test these theories.

We will now have many slashdot posts saying "I've not tested this but I think A (or B, or C, or X)"

Here's where the scientific method comes into its own. Anyone who cares enough can actually test and post their results. I'd be interested in seeing what they look like. I don't have a database of spam to test against (and please don't volunteer to sign me up for some :) but it would be interesting to see whether what looks convincing in theory pays off in practice.

Re:The proof of the pudding... (3, Insightful)

shadow303 (446306) | more than 11 years ago | (#4276328)

From what I can observe from the writeup, Gray appears to be one of the "experts" that I refer to as "theory whores". Hard problems need to be tested, but some people seem to think that they can arrive at good results from an unproven theory. Anybody who has actually tested difficult problems to any extent could tell you that things don't always go as planned. An improvement with might work in theory, sometimes results in disaster due to minor points that the theory does not take into account.
Also, it bothered me that he objected to Paul's work biasing one side. It was almost like he thought it was a bug, but there was a good reason for biasing (reduce false positives). So my advice for Paul is, until you actually implement your idea, don't go trying to say that it is better than somebody else's method.

Re:The proof of the pudding... (2)

ajm (9538) | more than 11 years ago | (#4276375)

It's the feeling I got as well. The change Gray suggests might be good or they might be bad. What I liked about Paul's write up was that he determined practically that it worked for him. Sure the theory may not be perfect but, to use a broad analogy, you don't need general relativity to work out how fast a ball rolls down a slope, Newtonian theory works fine. It may not be worth it to get the extra .1% of accuracy, especially if, as you point out, it increases false positives. Only testing will tell.

Re:The proof of the pudding... (1)

NecrosisLabs (125672) | more than 11 years ago | (#4276361)

You, sir, have made my day; if I have to hear one more chucklehead say "The proof is in the pudding" I will not be held accountable for my actions.

poor Hotmail users are still in the cold... (4, Funny)

saskboy (600063) | more than 11 years ago | (#4276225)

I have some tricks for Hotmail users who cannot benefit from the technique above:
Filter any message without the @ in the address.
Filter Britney, Boobs, Penis, Inches, WIN, ___ ..... and your own email address userid.
Now you only have about 40 spams a day to deal with instead of 100.
Uncheck your information from being in the MSN directory too.

Enjoy :-)

Re: poor Hotmail users are still in the cold... (2)

IIRCAFAIKIANAL (572786) | more than 11 years ago | (#4276414)

I really feel for any plastic surgeons named Britney that focus on penile and breast enlargement surgery. They can't filter squat.

Re: poor Hotmail users are still in the cold... (1)

saskboy (600063) | more than 11 years ago | (#4276486)

I saw a /. sig the other day that said a person who gets spam must feel something like a Fat balding woman who needs bigger breasts and a longer penis to screw Britney.

Terrible Spam Filters (3, Informative)

DonkeyJimmy (599788) | more than 11 years ago | (#4276255)

It's good that work is being done to make a good weigted spam filter.

It's funny how bad the standard Microsoft spam filter is (the one present in outlook). It's simply a word lookup, where if the word is present the message is marked as spam. It looks for things like "for free?". You can see the full list here [iirusa.com], near the bottom. It's a little old, but not outdated (I think you can upgrade your spam filters, but I tested these, and the ones I tested work).

The adult filter isn't any better.

Naive Bayesian Learning (2, Interesting)

Anonymous Coward | more than 11 years ago | (#4276276)

Finally it is worth mentioning that if you really want to go a 100% Bayesian route, try Charles Elkan's paper, "Naive Bayesian Learning". That approach should work very well but is a good deal more complicated than the approach described above.
Here is the article [nec.com][citeseer.nj.nec.com]

Let's see (5, Funny)

sam_handelman (519767) | more than 11 years ago | (#4276287)

P (This is spam) = P (This is Spam | It will enlarge my penis) * P (It will enlarge my penis)

Now, given that I have prior knowledge that:
P (It will enlarge my penis)

is very low,

and given that, having never encountered anything which enlarges my penis in any permanent way, I have no knowledge of
P (This is Spam | It will enlarge my penis)

and we have the product of one probability which I know is low, and another of which I have no posterior knowledge, so we conclude that P (It is Spam) is also low, and that I must have requested more information on their new penile enlargement technique.

So, that message goes into the keepers.


P (It is Spam) = P (It is Spam | Frank is getting maried) * P (Frank is getting married)

So, I know frank is getting married, since he sent me this e-mail I'm considering filtering as Spam, and weather or not it is spam is pretty much independent of whether or not frank is getting married, so.... it's Spam. Away it goes.

P.S. I've deliberated made a hash of this for a joke. The actual rule is:

P (A & B) = P (A | B) * P (B)

Whatever Jaguar (Mac OS X 10.2) uses works! (1, Interesting)

Anonymous Coward | more than 11 years ago | (#4276293)

Is this what the new Mail.app in Mac OS X 10.2 uses?

I, myself, am not sure but the new Mail.app is smart and it does learn. After a week of "learning" it has correcly determined messages as spam more than 99 out of a 100 times.

filtering not the answer - maybe this is (5, Insightful)

frovingslosh (582462) | more than 11 years ago | (#4276299)

Sadly, unless you are an ISP or other mail service provider, filtering does nothing. The spammers work in volume. They count on hitting everyone to reach that .1% that will respond. That response is what they are after and what they get paid for. You likely know better than to ever deal with anyone who spams you or to ever respond to their spam. Filtering your own e-mail has absolutely no effect on the spammer, you were not going to respond anyway. By the time you filter they have already wasted your bandidth, and perhaps mailbox capacity and even forwarding limits from a forwarding service. Your filtering is useless, puny human!

Here is a suggestion for something that might make an impact on spammers: IF I open my firewall, I see several attempts a day from people trying to get into my mail server. Of course, I don't have a mail server, but spammers are always looking for open relay points they can spam from. My suggestion: Give the a nice open relay server they can send mail to. Of course, you don't want to piss off your service provider by sending spam, and your upstream speed might limit you to less than you can receive, so rather than run a full mail server lets modify some mail server code to just accept mail and send it to the bit bucket. Maybe we can even misconfigure existing code to do this with no programming changes.

No valid user will be affected, assuming you don't otherwise run a mail server. All that bandwidth you pay for can be used to receive e-mail from spammers before it ever goes out. Eventually their customers will see the response go from .1% to 0% and their business will dry up. This will impact spammers, blocking your own spam after it's been delivered will not.

This need not even impact your own bandwidth. You can run the server when you are done using your system (Might make a nice screen saver - a black screen that just shows how many spammed addresses were prevented from getting spammed). Or you cam impose limits on bandwidth at a firewall or router, or even restrict hours of access.

If we set up enough different false open relay servers I think we could have a real impact on the spammers.

Double whammy (0)

Anonymous Coward | more than 11 years ago | (#4276386)

Hehe, sounds like fun. Maybe I can then capture all the e-mail addresses that get run through my fake mail server, and sell the list back to the spammers.

Re:filtering not the answer - maybe SPOOFSERVERS (1)

saskboy (600063) | more than 11 years ago | (#4276395)

Hey, that is a really cool idea, I wonder if it can really work. It is a new idea to me, so if anyone knows if this is a joke, or a possibility, please let us know?

Then we need someone to develop some open source code that creates a dead end mail server on whoever installs the program. They should be able to set how much spam their server eats in a night, rated to bandwidth usage. I'd run it as a screensaver.

Re:filtering not the answer - maybe SPOOFSERVERS (4, Insightful)

netringer (319831) | more than 11 years ago | (#4276458)

I'm fairly sure a false relay won't work. Just like snail mail list sellers, the spammers salt their victim lists with their own valid addresses that they can check to see if the message is getting out.

BUT, an early spam filter at an ISP worked just like that. The design parameters were 1) that spam filtering require no more resources than actual delivery of the message, and 2) the filter give no indication to the spammer that the message was not going to delivered. This gives the spammer no feedback and forces THEM to waste CPU cycles which will slow them down.

Mmmm, I wouldn't try it (2)

mblase (200735) | more than 11 years ago | (#4276428)

This need not even impact your own bandwidth.

Last week (I can't find the article yet), Slashdot had a link to a column by someone who was (in his opinion) unjustly blacklisted for hosting an easily-accessible mail server. The moment his name hit that blacklist, he became a target for what may as well be every spammer on the planet. Even though he didn't actually have an open relay (just an easily-guessed password), the incoming traffic from so many e-mail spammers effectively brought his server to its knees. Changing his domain name and IP address was the only cure.

Building a "honeypot" mail server for spammers is appealing, but could be more trouble than its worth, especially since it's more or less irreversible. I'd advice against it.

Re:Mmmm, I wouldn't try it (2)

Deagol (323173) | more than 11 years ago | (#4276576)

I've always wondered what it would take to modify a sendmail or postfix configuration to become a "mail sink". Sure, there are tarpits that slow spammers down, but why not make a server that acts and smells like an open relay, but simply dumps the mail to /dev/null and tells the sender they were delivered? Maybe bandwidth might be an issue, but it may more effective than a tar pit.

A human watching over his spam software might notice if the target relay is delivering at a rate of 1 message per day and find another. If, however, he sees that the "server" is ripping through deliveries at a massive rate, he might stay with that server and all of his spam will vanish into the bit bucket.

Re:Mmmm, I wouldn't try it (2, Offtopic)

frovingslosh (582462) | more than 11 years ago | (#4276594)

Actually, he did have an open relay, he just wants to hide behind a lame claim that it wasn't open because the spammer had to lie to use it! Imagine that, a spammer lying. He was a lawyer, and we know they never lie ;-). IMHO he got a lot less punishment than he deserved.

And what was the reported problem he cried about? Not an overload on his network, that was not his complaint. But his domain name being blacklisted. With good reason, IMHO. He was running a server that spammers used, and could even see this when the people he invited to test his system got right in. He then claimed they misused his system because they gave a false name and suggested he should sue them!

Maybe this guy was just too stupid to block a port on an incoming firewall to keep the outside mail server users out. It seems unlikely though, particularly if he had the ability to set up a mail server (supposedly for the use of his own local network). It sounded more to me like there was a good chance he knew exactly what he was doing and wanted to set up a server for spamming, and was blowing smoke when he got black holed.

Getting black holed will not be a problem for a dummy server that never actually sends mail (the black hole people are not out there port scanning like the spammers are). Even if your dummy mail server were to be blacklisted, so what? That in no way would affect your normal e-mail that you send through your service provider.

Why not? (1)

Greedo (304385) | more than 11 years ago | (#4276663)

Why not try it? The problem the guy had last week was that he did this on his home box that we used for other stuff (specifically, some mail-related stuff).

So when he was blacklisted, his legitimate work was affected.

There is nothing inherently wrong with running a honeypot mail-server. Just do it somewhere that isn't going to screw you when it shows up in ORBZ.

(In fact, you could set up one server that acted as a honey-pot, and publish all the IPs of the spammers who try and connect to it. Other servers could use those IPs to block access at a lower level, without the risk of running their own honey-pots.)

Re:filtering not the answer - maybe this is (1)

RealAlaskan (576404) | more than 11 years ago | (#4276495)

... rather than run a full mail server lets modify some mail server code to just accept mail and send it to the bit bucket.

That's clever. One possible problem might be that spammers would quickly learn to test your relay with a message, just to make sure that it didn't all go to /dev/null. I suppose that we'd have to set things up so that single emails got forwarded, and bulk emails didn't, just to avoid that. Now we really are running an open relay, though it isn't much good to spammers.

Another problem is that spammers might start automatically sending the same spams to the same lists via several different open relays. Thus, we might increase the volume of spam, at least in the early stages.

I agree that the ultimate solution to the spam problem isn't going to come from filtering at the email client. It's a social problem, and needs a social solution. Filtering by ISPs (on by default, but easily circumventable by knowledgeable users, maybe?) would help, and so would us telling our less-knowledgeable friends and relations NOT TO BUY FROM SPAMMERS!

Re:filtering not the answer - maybe this is (1)

GigsVT (208848) | more than 11 years ago | (#4276557)

would us telling our less-knowledgeable friends and relations NOT TO BUY FROM SPAMMERS!

I can hear them now:
"I didn't buy from a spammer, it said 'THIS IS NOT SPAM!!!' and it had legit looking unsubscribe info."

You might do better to send out a spam, then murder all the buyers once you get their address. Intellectual cleansing.

Re:filtering not the answer - maybe this is (2)

frovingslosh (582462) | more than 11 years ago | (#4276667)

You might do better to send out a spam, then murder all the buyers once you get their address. Intellectual cleansing.

I've often wondered why we don't see a few spammer's heads on pikes to greatly reduce this problem, but there is a lot to be said for your solution too. Just don't do it on the day some good soul gets fed up with spammers and comes after you! ;-)

Re:filtering not the answer - maybe this is (3, Insightful)

stienman (51024) | more than 11 years ago | (#4276525)

Interesting idea, but easy to verify. Send one thousand emails, and include a verifiable email in it. Check the email a few hours later - if it's not there, then don't use the relay.


Re:filtering not the answer - Spam Honeypot! (1)

Ma$$acre (537893) | more than 11 years ago | (#4276534)

A Honeypot for spammers? Sounds like an idea who's time has come.

The problem's are many, but the outcome would be fantastic. Create a Mail-dev/null program which looks like a "real" system and make it hackable. Keep the same doors the spammers would normally use. Make said program freely available to anyone and everyone. Make it that much more difficult for Spammers to find a working program to hack.

Neural Net Spam Filtering (3, Interesting)

ShakaUVM (157947) | more than 11 years ago | (#4276323)

At UCSD, Bob Boyer and I wrote a neural net spam filter. Neural Nets, as everyone knows, are not really like biological brains, but really just statistical engines similar to the approach the guy above claimed to do.

Our approach worked pretty well (95-97% accuracy), and we had to deal with the same issues that the above "Bayesian" approach did. I.e., weighing the neurons so that false positives occur much less frequently than false negatives, etc. We built it using data on spam collected from the UCI machine learning repository.

It ties in with procmail. I'm not really a windows guy, so if anyone knows how to put a filter between an IMAP server and Microsoft Outlook/Netscape Communicator, I'd be interested in hearing how it's done.

The README for it is at: http://www-cse.ucsd.edu/~wkerney/spamfilter.README
And you can download it at:
http://www-cse.ucsd.edu/~wkerney/spamfilter.t ar.gz

-Bill Kerney
wkerney at ucsd.edu

What happen to the IBM/Redhat article (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4276332)

WTF? Where did the redhat/ibm article go

SpamAssassin - duh (3, Interesting)

Gothmolly (148874) | more than 11 years ago | (#4276346)

SpamAssassin [taint.org] works great for me. It eats about 90% of my spam, you just hack up a little procmail file for it, and you're done.

With so many people using SpamAssassin these days, I can't see how this is a timely or newsworthy item. More like from the been-there-done-that-dept..

Re:SpamAssassin - duh (3, Insightful)

Eric Seppanen (79060) | more than 11 years ago | (#4276496)

Reasons why I don't use SpamAssassin:
  1. It tends to rely on blocklists, many of which have demonstrated unfair practices in the past.
  2. The more SpamAssassin is used, the more spammers will specifically avoid doing things SpamAssassin checks for.
  3. It's a gigantic heap of perl, the Write-Only (tm) language. I hate the fact that every perl program demands I mess up the package manager on my system by blindly downloading a half-dozen new modules. And it's slow!
  4. Bogofilter [sourceforge.net] is better. duh.

How do you pronounce "Bayesian" anyways? (2)

mblase (200735) | more than 11 years ago | (#4276367)

While I love everything there is to love about open source (code and ideas), I kind of worry when I read how successful all these new Bayesian/Grahamian filtering techniques work.

Not being a coder or statistician myself, I'm left wondering if the spammers can exploit it for a workaround. Is there something "built in" to these filtering techniques that can be used by spammers to effectively circumvent them?

Re:How do you pronounce "Bayesian" anyways? (2, Interesting)

KieranElby (315360) | more than 11 years ago | (#4276497)

> Is there something "built in" to these filtering techniques that can be used by spammers to effectively circumvent them?

Yes and no.

To defeat a bayesian filter, the spammer needs to make his email contain similar words, and combinations of words, to your genuine email, while at the same time making sure that the words used are different to those in known spam.

So saying 'click here to make $$$' won't work any more, since most of your regular emails don't contain the word combinations 'click here' and 'make $$$', whereas known spam emails will.

However, we're already beginning to see spammers making their emails less obviously spam.

For example, the spammer may use an email along the lines of:

"How's things?

Have you seen yet?

Don't forget to mail me those documents.

A Spammer"

Even a bayesian filter will struggle to distinguish that from:

"Have you seen the story on slashdot yet?

Don't forget those reports.

Your Boss"

Re:How do you pronounce "Bayesian" anyways? (1)

nelsonal (549144) | more than 11 years ago | (#4276537)

I thing its similar to Beige combined with an That's a short a, I don't know how to represent it here. Others are likely to correct me, but the only way to circumvent them are to make your spams different from other spams and similar to your normal mail. There isn't a single way to get on a good list, since there is no single good list, only attributes that make it more or less likely to be a Spam.

Re:How do you pronounce "Bayesian" anyways? (2)

PurpleBob (63566) | more than 11 years ago | (#4276590)

I bet they could get around it by picking a few random words from a dictionary and adding it to the end of the spam. If one of them were an obscure word that you've received in one or two legitimate e-mails, the filter would decide "Hey, I've never gotten a spam with the word 'yarborough' in it before, so it must be real".

Re:How do you pronounce "Bayesian" anyways? (0)

Anonymous Coward | more than 11 years ago | (#4276636)

Did you read any of the articles, or even the posts around you?
New words are assigned a rather neutral value, which means they won't be included in the 15 most interesting words list which is used to determined whether or not its spam.

Well... (2, Informative)

ccarter (15555) | more than 11 years ago | (#4276372)

I hate to give any kind of credit to M$ but they patented the idea of using Bayesian analysis for spam filtering circa 1995. They even had it in one of thier beta's. However the filters were tagging some of those fricking Blue Mountain greeting cards as spam (imagine that!) so Blue Mountain sued them on anti-competitive grounds and M$ pulled it. Blue Mountain wanted to have the spam filters universally pass Blue Mountain content but MS refused that on the grounds that if a user considers it spam then it is in fact spam to them (Hurray for the "bad guys"!). The law suit has been settled/dropped/died for reasons I don't know.

Anyway I hear that the next version of MSN will have a Bayesian filter and that it will be introduced in an up coming version of Outlook Express (no idea about Exchange and Outlook).

BTW I believe internally MS uses this technique for spam control and that they don't seem to have any spam problems.

Why just spam? (1)

KieranElby (315360) | more than 11 years ago | (#4276396)

Sure, spam is a big problem, but right now only 10-20% of my emails are spam, and most are easily identifiable by subject.

On the other hand, I get hundreds of emails every few days covering a range of topics, which need to be manually sorted into folders.

What I'd like to see, and I suspect I'm not alone here, is similar software that can sort email into any number of categories, not just spam and non-spam.

For example, if I have an email folder called 'fishing', containg emails from fishing buddies, then next time I get an email containg references to 'casting', 'trout' and 'it was *this* long', it should be sorted into that folder automatically.

I'd be curious to know if there's any existing software to do this, and if not, I'd be tempted to have a go at knocking something up to do this.

One tricky bit would be how to integrate it with the email client. I'd imagine that users wouldn't want to switch away from Outlook/Mozilla/Mutt/Whatever merely for this feature, so it would have to be client-agnostic.

I'm thinking that implementing a simple IMAP server would be the easiest option since this allows for server-side folder management. It would then be case of maintaining word counts (Bayesian or otherwise) for each folder, and classifying mail accordingly.

Anyone else had any thoughts along these lines?

Re:Why just spam? (3, Informative)

McFly777 (23881) | more than 11 years ago | (#4276575)

Easy. Just re-run the spam filter on your 'cleaned' mail using a ruleset generated by splitting the mail into topical vs. everything else.

Re:Why just spam? (1)

GigsVT (208848) | more than 11 years ago | (#4276603)

What I'd like to see, and I suspect I'm not alone here, is similar software that can sort email into any number of categories, not just spam and non-spam.

You must run Windows. Try a modern OS sometime. This has been a standard feature for years.

Re:Why just spam? (1)

nelsonal (549144) | more than 11 years ago | (#4276621)

I use a series of rules in outlook to do something similar. I don't know if other email programs support this feature to the same level. But in outlook you can create rules to move email to a folder, or delete it, reply to it, etc. based on sender, words in the subject, body, or nearly any other attribute. Mine is to sort email into company folders, as I work for a pension fund, and recieve 50+ research emails a day. It also keeps my inbox empty, since the company folders are off the exchange server. Most of the people deleted anything unread (usually several hundred) more than a week old before I showed them how to use rules to sort the stuff. Email me if you want more info about them, they are under the tools menu. Not quite as good as a Bayesian solution, but pretty good nonetheless.

Re:Why just spam? (2, Informative)

shrikel (535309) | more than 11 years ago | (#4276649)

Have you tried Ms Outlook? It's got extensive rule-based sorting capability. It doesn't work for IMAP, and you mentioned IMAP leter in your message, but it's not clear that that's all you're dealing with.

Brain exploded (2, Funny)

operagost (62405) | more than 11 years ago | (#4276397)

Note to statisticians: the product of the probabilities is monotonic with the Fisher inverse chi-square combined probability technique from meta-analysis. The null hypothesis is that the probabilities are independent and uniformly distributed.
Ouch! My brain is hurting, Doc!

Why is spam still a problem? (0)

Anonymous Coward | more than 11 years ago | (#4276432)

I don't get it. Simply allow incoming email only from user names you know. Period.
Why is this hard to understand?

Re:Why is spam still a problem? (0)

Anonymous Coward | more than 11 years ago | (#4276607)

Easily understood.
Fairly easily implemented.
Largely unused... ...since it's a restriction that requires
someone already know you or you know them, but not all non-spam e-mail is between known parties. And adding the barrier of a confirmation step for an average person - when most are not expecting it - gets a "Why should I waste my time with this isolationist bozo?" response.

spam is already keeping up? (0)

Anonymous Coward | more than 11 years ago | (#4276434)

I've noticed in the past 2-3 weeks that the look of the spam I've received is a lot more like regular mail.

Your home refinance loan is approved!

To get your approved amount go here [slashdot.org].

To be excluded from further notices go here [slashdot.org].

carpet 5gate 1932zIgl2


It's still identifiable as spam with a probability filter, but it's not that far removed from a legitimate mail an AOL dork might send or receive. (not that I care about them getting spammed!)

Bayesian Filtering Works (1)

CleverFox (85783) | more than 11 years ago | (#4276439)

I have implemented Paul Graham's algorithm at my corporation, and it is blocking 90-97% of our spam each day. It is "good stuff". Combine that with Razor v2 and some other filtering I do, and nary a spam gets thru.

authorization based email box (1)

erikdotla (609033) | more than 11 years ago | (#4276471)

I realized one day that filtering spam out by content is a futile exercise. I use a simple method that has worked perfectly: If the FROM address of an incoming message is not in my contact list, the message is Trashed. Before emptying the trash, I'll glance through it to be sure that I didn't recieve a legitimate message from someone not in my list. Since I've used this, not one spam has ever appeared in my Inbox. This is important since I use mobile devices and other strange ways to access my email that would be very sensitive to spam overload. Fact is, 99.999% of email I receive is either 1.) From people already on my contact list, or 2.) People who inform me they're going to send an email. Before I give out my address, I inform them that I need to know their address first, and add it to my contact list. If someone gets my email from someone other than me, or otherwise didn't talk to me first, I probably don't want their email anyway. And if it's important, they'll get in touch with me. I'm using Outlook for this solution and use a rule that moves all the messages out of the Inbox that don't meet this criteria. I plan to switch to Evolution soon under Mandrake and I'm sure I can program a similar function. It's much easier to spot 1 message from a legitimate sender out of 100 spams (takes only a few seconds in fact) than it takes to manually delete spams or constantly fiddle with filters. Each day, I'll glance at the list of 100-200 spams that have collected in my trash box, and within a few seconds, I can spot if someone I know has sent me something who isn't in my list. From that point forward, they're in my contact list, and it never happens again. At some point I plan to set up an auto-reply system that gives people a URL that they can visit to "ask for permission" to send me email. Spammers won't use it. I haven't bothered yet because I'll need to carefully design this to prevent my address from being "confirmed" by spammers as a result of this message, but I have ideas for that (send from a null account, use a picture of my email address in the message, with instructions on how to ask permission.) At that point, I can safely instant-trash all unrecognized recipients. I'd love some feedback on this method. It's worked great for me, though admittedly it won't work for those who recieve many emails from new contacts, such as someone who publishes (eek!) their address on a site for inviting new messages.

Re:authorization based email box (1)

Hayzeus (596826) | more than 11 years ago | (#4276631)

The main problem with this approach is that it's a little awkward for those of us who frequently receive (non-spam) email from strangers. I get a lot of these.

There are a number of people who use your method, but automate it, which is a better way to go but still a bit awkward. Incoming emails not on the reply list generate a reply requesting the original sender to go to a web page, which allows them to enter themselves on the contact list automatically. Conceivably, the URL can contain a "web bug" that merely requires the sender to visit the link to have the add happen automatically. (Of course, some SPAM filters will block email containg web bugs...)

The best results I've seen personally involve spamassassin, which cuts my incoming volume from about 70spams/day to 1 spam every 3 or 4 days. Highly recommended for perl/procmail-capable platforms.

All of this of course, only adresses the problem at the level of the individual user. The larger problem is not likely to be solved by any means short of legislation.

keyword matching isnt the answer (2, Interesting)

mack knife (96580) | more than 11 years ago | (#4276489)

sites like yahoo, hotmail, etc are in a unique position to rid their users of spam.

i don't see why they cant implement some system that scans incoming mail for its users' mailboxes, maybe does a checksum for each message or something, and if it finds that a number of its users are receiving exactly (or nearly exactly) the same message, assume it's spam. nuke the messages, and any new incoming ones.

yeah, if such a system only scans a small number of mailboxes, it may filter out mailing list posts and so on. but it gets more and more reliable the higher number of mailboxes it tracks.

this avoids searching for certain keywords and eliminates false positives. after all, how well would these keyword searching methods do if i were to quote a spam message in an email to a friend?

SPAM (0)

Anonymous Coward | more than 11 years ago | (#4276530)

What you call SPAM I call creative marketing, besides someone has to get this economy going?

Bayesian vs not isn't really the point (4, Insightful)

XDG (39932) | more than 11 years ago | (#4276531)

Gary is both right in some respects and irrelevant in others. Here's the key line in his article that deflates it a bit:
It is untested as of now. It is based purely on theoretical reasoning. If anyone wants to try and it test it in comparison to other techniques, I'd be very interested in hearing the outcome.
On the other hand Paul Graham has actually tested his model and it works. I've worked it up in perl and tested it on my own data set and it works there, too. Paul acknowledges that he's being a bit fast and dirty, but the proof is in the pudding. The rest is just academic quibbling over the fine points.

I'm not sure why this particular article needed to be posted, as it's just one of several alternative approaches and an untested one at that. On Paul's page, he also lists several published academic papers with other alternatives -- all actually tested, of course.

Gary is basically right in questioning the use of the word "Bayesian". Paul's approach is more about weighing "evidence" as given by the appearance of certain words, rather than in figuring out the probability of spam assuming a "prior". See Paul's explanation [paulgraham.com], but if you check the article he references at the end, you'll note that the method Paul uses is only one of several methods to solve an underspecified problems. It's a reasonable guess, not necessarily the only guess.

Looking at another article [lanl.gov] Paul references, given the word independence assumption, the more formal Naive Bayesian approach calculates as follows:
p(spam) = [ p(spam)*p(word1|spam)*...*p(wordn|spam) ] / [ p(spam)*p(word1|spam)*...*p(wordn|spam) + p(!spam)*p(word1|!spam)*...*p(wordn|!spam)]

This is similar to Paul's approach except for including a "prior" assumption of p(spam) -- the expected probability of any email being spam, calcuated from the historically observed frequency of spam. By leaving it out, Paul implicitly assumes that 50% of mail is spam -- that's his "prior" estimate of the spam rate. Given the other adjustments he makes to his sample, that appears to be acceptable in practice. (Paul overweights the spam prior, but also overweights the effects of "good" words.)

I'd personally prefer to overweight the "good" e-mails entirely rather than just put a "good-multiplier" on them like Paul does, but that's just quibbling over small bits.

As to the bit that Gary raises about Paul assuming a spam probability for an unknown word -- Paul originally said .2, then revised to .4, but really should have put it at .5 or just excluded it from all calculations. A new word has no robustness as a predictor (which is why Paul dropped words that didn't appear five times anyway). In practice, a new word at .4 isn't going to be among the 15 most interesting words to make the calculation from, anyway.


not 100% - not good enough (1)

kid_wonder (21480) | more than 11 years ago | (#4276536)

because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done

Right. Try that one again after your non-100% effective filter starts filtering out business e-mails. Then where'll ya be? nowhere.

AI people have absolutely no common sense. Its been proven by my neural net.

