Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Working Bayesian Mail Filter

CmdrTaco posted more than 11 years ago | from the stuff-to-play-with dept.

Spam 313

zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."

Sorry! There are no comments related to the filter you selected.

I LOVE (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4588998)


Fist Sport! (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4589001)

I get plenty of Master Bayesian Spam already thank you very much.

Whas that? (2, Interesting)

cos(0) (455098) | more than 11 years ago | (#4589004)

Would anyone care to explain what is a "Bayesian" mail filter?

Re:Whas that? (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4589034)

Slashdot []

Re:Whas that? (1)

Raul654 (453029) | more than 11 years ago | (#4589035)

You took the words right out of my mouth (and I want them back!)

Re:Whas that? (4, Informative)

DalTech (575476) | more than 11 years ago | (#4589064)

Bayesian is statistical theory and methods useful in the solution of theoretical and applied problems in science, industry and government.

Bayes Explained (1, Informative)

brw215 (601732) | more than 11 years ago | (#4589088)

A naive bayes classifier is an algortihm that is based on bayes therom in mathematics. It is based on the following therom

Pr(h|D) = Pr(D|h) * Pr(h)

where Pr is probabilty, h is the hypothesis and D is the data. In this case it would be

Pr("SPAM"|Email) = Pr(Email|"SPAM") * proportion of spam.

The trick is how to estimate the second term. This is a very popular machine learning algorithm due to its simplicity and elegance. For more info, check out this link Bayes []

Re:Bayes Explained (1)

Stonehand (71085) | more than 11 years ago | (#4589123)

Don't forget the P(D) term.

Re:Bayes Explained (5, Informative)

johnynek (36948) | more than 11 years ago | (#4589191)

That's /. for you. You guys have modded up to 5 a post that is wrong in both of the equations it posts.

It should be:

Pr(h|D) = Pr(D|h) * Pr(h) / Pr(D)


Pr("SPAM"|Email) = Pr(Email|"SPAM") * (proportion of spam) / (probability of getting this paticular Email)

What about decimal radix? (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4589198)

Is your link infested with decimal? Or does it have the much better hexadecimal? Decimal causes cancer indeed.

Re:Bayes Explained (2)

capt.Hij (318203) | more than 11 years ago | (#4589259)

Great, now the spammers will hire mathematicians to figure out how to best defeat the common algorithms used to calculate Pr(D|h). It is the same old story. In a war over information only the mathematicians win.

Re:Whas that? (5, Funny)

Evil Adrian (253301) | more than 11 years ago | (#4589092)

If you had just clicked the POPFile [] link, you would see the explanation.

Initiative is your friend.

Hyperlinks are your friend.

Don't be afraid, just click.

Re:Whas that? (5, Informative)

dvk (118711) | more than 11 years ago | (#4589094)

From what I understand, it is a mail filter which determines what to filter out based on a statistics-based machine learning system called "Bayesian Learning".

A couple of URLs quickly found on Google: section-7.html [] ssets/images/week09.pdf []

Also, any decent AI/machine learning textbook ought to cover the topic.


Those terrorists! (0, Offtopic)

edb (87448) | more than 11 years ago | (#4589007)

This is a clear and present threat to our society. Good thing the FBI acted quickly!

Re:Those terrorists! (1)

edb (87448) | more than 11 years ago | (#4589027)

Gaack! reply connected to wrong parent article!

Never mind...
- Emily Litella

True "Bayesian" and do I care? (1, Flamebait)

Dog and Pony (521538) | more than 11 years ago | (#4589012)

From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian.

Who cares? Whatever works best should be used, not the one with the coolest name or whitepaper, right?

Re:True "Bayesian" and do I care? (0)

Evil Adrian (253301) | more than 11 years ago | (#4589029)

I guess if I was using a Mac, that would be a valid statement.

Re:True "Bayesian" and do I care? (0)

Anonymous Coward | more than 11 years ago | (#4589040)

ifile has been around for a LONG time and it uses "Naive Bayesian". It's functioned good enough for me.

What about decimal? (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4589045)

Do you care that it uses the hated decimal system (if it does)? Decimal is bad. Right?

what about decimal ? (0)

Anonymous Coward | more than 11 years ago | (#4589016)

Does it give hexadecimal output (like for messages blocked)? I hate decimal.

Re:what about decimal ? (0)

Anonymous Coward | more than 11 years ago | (#4589167)

good point. submit a patch immediately.

Re:what about decimal ? (0)

Anonymous Coward | more than 11 years ago | (#4589186)

I'm curious. When you developed your decimal v. hexidecimal comment generator script did you mostly use base 16 or base 10? (5, Informative)

supton (90168) | more than 11 years ago | (#4589019)

Saw this a few weeks back... [] Spam filter in Python using Naive Bayes.

Honest to whom? (0, Offtopic)

matth (22742) | more than 11 years ago | (#4589036)

Honest to god? or God? Just which god/God is it honest to? Capital or lowercase G?

Re:Honest to whom? (0)

Anonymous Coward | more than 11 years ago | (#4589050)

Whom() -- deprecated, see who()

Re:Honest to whom? (1)

jez9999 (618189) | more than 11 years ago | (#4589197)

Whom's a perfectly valid word today, you moron. It's like saying 'dont say television when we can say tube'.

Sure it's promising (4, Insightful)

bigberk (547360) | more than 11 years ago | (#4589042)

And I'm going to check it out right now :) But one long standing I fear with such solutions is spammer's adapting to new environments (changing wording used, making the emails look more professional). Sure, they're dumb shits but they're still humans with brains.

The decimal issue (-1)

Anonymous Coward | more than 11 years ago | (#4589056)

What about the use of decimal? Will you still use it if it uses the decimal system (which geeks hate)? Geeks like hexadecimal and binary.

Why hex and binary? (0, Troll)

Anonymous Coward | more than 11 years ago | (#4589179)

I have five fingers on each hand, so I prefer decimal.

If I had four fingers on each hand, I'd prefer octal.

If I had one finger on each hand, I'd prefer binary, but I think I could manage without using my fingers

If we had eight fingers on each hand, we'd prefer hex, but then it wouldn't be hex, because we'd have used a different numerical system, that'd be base 16, but with 16 numbers instead of 10 numbers and 6 characters.

My conclusion: You're stupid, ignorant and not a geek.

Re:Sure it's promising (5, Informative)

outlier (64928) | more than 11 years ago | (#4589136)

While spammers will undoubtedly continue to refine the content of their messages, one of the strengths of using a Bayesian filter like this is that it uses the user's own spam and non-spam (ham) as the basis for its calculations. This means that messages are categorized not only by whether they contain spammy words, but also whether they contain the hammy words from your own messages. So, even if spammers could refrain from using words like "free" "mortgage" "sluts" and "spam", they probably wouldn't use words that discriminate your own ham from others (e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam. The challenge to the spammer would then be to target you with spam that looks like *your* ham (which is probably different from the ham of others).

Future systems (assuming faster processors and more HD space) could include semantic analysis (e.g., Latent Semantic Analysis) to do an even better job and go beyond the word level.

Re:Sure it's promising (4, Informative)

rgmoore (133276) | more than 11 years ago | (#4589262)

Another important point is that there are some things that they can't hide, at least not in their current working model. If they're trying to sell you something, they have to describe what that thing is and where you can get it, and those descriptions are unlikely to be in any legitimate email. If they want to advertize a web site, they have to include its URL in the message, and the filter can catch that. If they advertize a physical address or phone number, the system can catch those, too. If they don't repeat the message, it means that there's inherently less spam, because I'm only seeing each add once.

It's also not possible to disguise everything in their headers, so things like their posting host (either the one they pay for legitimately or any open relay they're taking advantage of) will wind up being a pointer to who they are. They certainly can't change anything about the headers that's added downstream of their posting host, so as long as they keep using the same one it's likely that there will be characteristic stamps there that the spammers absolutely can't change. I know that analysis of the headers is part of bogofilter [] , another Bayesian filter that I've been using to good effect.

Re:Sure it's promising (2, Interesting)

bmwm3nut (556681) | more than 11 years ago | (#4589138)

that's the beauty of this approach. the filter learns all the time (or atleast you can set it up that way). so if spammers get smart, it doesn't take long until the filter adjusts. what i'd love to see is this filter built into a mail client where you have two buttons for delete. one, just to delete the mail, the other to delete it and mark it as spam. when you press that button the filter would scan the email and update its rules.

Re:Sure it's promising (4, Informative)

rgmoore (133276) | more than 11 years ago | (#4589313)

Bogofilter [] comes close to this. It has an operating mode where each file that it filters is automatically added to the appropriate corpus, either of spam or non-spam. Since it's correct the vast majority of the time, that means that there's very little for the user to do. When it is wrong, you just take the messages that it miscategorized and feed them back into the system with the notation that they were originally marked incorrectly, and it backs out the changes to the wrong category and adds them to the correct category.

I'm using bogofilter with Evolution [] , and it works very well. I just have two extra folders, one for false negatives and one for false positives. When I notice mail that's been flagged incorrectly, I drag it into the appropriate folder and run a script that tells bogofilter to correct its mistake. Then I either flush the mail (if it was spam marked as non-spam) or process it normally (if it was non-spam marked as spam). I've only been using it for about two weeks and it already has a nearly zero false positive rate (i.e. incorrectly flagged as spam) and a usefully low false negative rate (i.e. incorrectly flagged as legitimate).

Re:Sure it's promising (2)

Theodore Logan (139352) | more than 11 years ago | (#4589145)

Well, as the man says in the article:

The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can write software that recognizes their messages, there is no way they can get around that.

And I think that in this he is correct, almost even provably correct. That's theory, however. In practice no system, short of "real" AI, will be good enough to always recognize spam with a zero false positive rate. It may eventually be good enough, but it won't be perfect. Natural language is just too hard to parse in this way.

But don't despair. If it flunks, there's always spammotel [] and their likes.

Re:Sure it's promising (1)

wheany (460585) | more than 11 years ago | (#4589201)

I've already seen a couple of spams that had bunch of nonsense and a picture attachement. I didn't open the picture, but it could've had an address...

Re:Sure it's promising (1)

chrsbrwn (14235) | more than 11 years ago | (#4589216)

Note that Bayesian mail filters use a probabilistic analysis of the word distribution in the email you feed it during the training process in order to classify email as spam, or nonspam (and in the case of popfile, any other category you want to create).

As long as the spam you receive remains sufficiently different from the nonspam email you receive, Bayesian filters should still flag it properly. To put it another way, it doesn't test for the presence of specific words to categorize as spam (like SpamAssassin does), instead it uses the probability database built up during the training process to determine how similar the prospective email is to either the spam you have received and trained upon or to your regular email that you have trained upon. Thus it is much less susceptible to spammers changing their wording in order to defeat the filter.

Professional Looking Spam May Be Impossible (4, Insightful)

Bob9113 (14996) | more than 11 years ago | (#4589237)

This may be self-regulating. Consider the Skinner box; if something is capable of perfectly emulating recognition of Chinese, then it can be said to recognize Chinese. Likewise, if a spammer becomes sufficiently skilled at writing undetectable prose, he or she will have reached a skill level at which he or she can pursue more profitable writing ventures. The margins in spam are pretty small. Those spams are being written by morons because morons are cheap.

Re:Sure it's promising (2)

Brendan Byrd (105387) | more than 11 years ago | (#4589288)

Is there an application to this theory with SpamAssassin? Right now, it's more or less human-edited words and phrases, but applying a real Bayesian method to it would increase it's accuracy. I've also consider making a filter that would change the scores of the different SA rules to reduce the false positives, but this would be a long project.

Server-side solutions? (3, Interesting)

Quixote (154172) | more than 11 years ago | (#4589046)

Any server-side solutions (MTA==qmail, MDA==procmail) using this (Naive-Bayesian) technique out there?

Re:Server-side solutions? (2)

rehannan (98364) | more than 11 years ago | (#4589102)

I've been using PopTray [] (a POP3 email checker for Windows). You have the option of defining "rules" which allow you to delete emails server-side.

It's not a "smart filter" but it works fine for me.

Re:Server-side solutions? (1)

Saint Aardvark (159009) | more than 11 years ago | (#4589223)

Oh man, that looks perfect...sorry, see my post below re: something for Windows users. We use SpamAssassin, and I'd love something that would let people filter by the score. OE doesn't let you filter on any header, but this does...sweet. Thanks for the tip.

Re:Server-side solutions? (4, Interesting)

cmeans (81143) | more than 11 years ago | (#4589174)

James [] is a 100% Java Email server (SMTP, POP3, NNTP, and IMAP soon) that supports mail-server extensions via the Mailets API [] . I developed a Java implementation of the Bayesian rules discussed, so that they could be used in any configuration, but also provided a mailet wrapped implementation so that the filtering (or flagging) could be done at the server side.

Oops, screwed up the URL... (2)

cmeans (81143) | more than 11 years ago | (#4589263)

Apache Jakarta James is at [] .

Re:Server-side solutions? (1)

ranmachan (320399) | more than 11 years ago | (#4589178)

apt-get install bogofilter :-)

Re:Server-side solutions? (4, Interesting)

koreth (409849) | more than 11 years ago | (#4589239)

I've been using SpamProbe [] (which gets invoked from procmail) with excellent results.

New open source business-model? (-1, Troll)

Anonymous Coward | more than 11 years ago | (#4589047)

1: Write free software.
2: ?
3: Filter your mail.
4: Profit!

Re:New open source business-model? (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4589049)

You are such a faggot.

Mozilla in Process of adding Bayesian filter (5, Interesting)

AT (21754) | more than 11 years ago | (#4589055)

The mozilla mail client is getting a Bayesian mail filter, too. See . Unfortunately, it probably won't show up until after version 1.2 is released.

Re:Mozilla in Process of adding Bayesian filter (2)

Jugalator (259273) | more than 11 years ago | (#4589125)

And it seems likely the SpamBayes project [] will work as the foundation for their mail filter.

There are a few other applications [] that use this code as well, such as an Outlook 2000 add-in.

That Google search... (4, Insightful)

Jugalator (259273) | more than 11 years ago | (#4589061)

Try searching for "bayesian email filter" instead of just "bayes email filter" (as in the news post). You'll get better results and more hits since Google doesn't match "*bayes*" (as one would think) when searching for "bayes", but only the actual word "bayes".

the decimal angle... (0)

Anonymous Coward | more than 11 years ago | (#4589077)

What about the use of decimal in these sites? Can I filter out sites that use the decimal cancer? Geeks hate decimal.

Bayesian? Wow!!! I'm sooo excited. (Irony!) (5, Interesting) (551216) | more than 11 years ago | (#4589062)

A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).

More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines [] and, somewhat older, Maximum Entropy models.

Enough nerd talk for today :-)

does that link have decimal radix? (-1)

Anonymous Coward | more than 11 years ago | (#4589139)

I don't like decimal. Can you give me a hexdadecimal link? Or are you too stupid to understand hexadecimal? I think that likely.

Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) (1)

wheany (460585) | more than 11 years ago | (#4589224)

They make false assumptions about the data (independence of features).
If it works (and people say it does), who cares?

Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) (1, Informative)

Anonymous Coward | more than 11 years ago | (#4589244)

I've been doing research into email filtering using AI, and SVMs/kernel machines seem to work well (statistically, they're correct more than the other methods), but they require massive tuning.

On the other hand, Naive Bayes is usually easier to implement, easier to tune, and only trails by a few percentage points.

One of the more promising bayes units is autoclass, offered by Cheeseman (et. al.) - public domain classifier that's been around for years and years, and seems to perform quite nicely.

Forget Bayes (5, Funny)

Evil Adrian (253301) | more than 11 years ago | (#4589070)

We need the Stalin Mail Filter (TM) -- it detects spam, hunts down the spammer, and exiles them to Siberia.

What do you think about decimal ? (-1)

Anonymous Coward | more than 11 years ago | (#4589097)

What about decimal? Isn't that worse than spam? Use hexadecimal.

Re:What do you think about decimal ? (-1)

Anonymous Coward | more than 11 years ago | (#4589175)

Insightful. It's so insightful it makes me want to incite a full riot and possibly do a few leisurely pelvic thrusts (the pounds per square inch of which are measured in hexadecimal due to the increased precision thereof). Thank you.

*BUT* it's a Perl script... (2, Redundant)

pilot1 (610480) | more than 11 years ago | (#4589074)

Sure it's great that someone made one, but its a perl script. We might be able to use perl , but most of the "normal" people have never even heard of perl, let alone them having knowledge of running perl scripts. It would be great if someone ported this, to an .exe file or something that everyone could run. It'll probably happen eventually.

Does it use decimal radix ? (0)

Anonymous Coward | more than 11 years ago | (#4589110)

Perl can use hexadecimal. Is there decimal in the source? Then it is evil. Decimal is evil to geeks. Decimal is the Microsoft of radices.

Re:*BUT* it's a Perl script... (2, Funny)

Niksie3 (222515) | more than 11 years ago | (#4589144)

sure... an .exe file everyone could run... have you had your pills today? a perl script runs on many more platforms then any .exe file.

Re:*BUT* it's a Perl script... (1)

pilot1 (610480) | more than 11 years ago | (#4589226)

Last time I checked half the world could care less if it ran on *NIX, and Macs. Sure us geeks can run a perl script, but most people can't. Most people also have Windows, so it makes sense to port it to something that almost everyone can run, not just geeks.

Re:*BUT* it's a Perl script... (1)

Fastolfe (1470) | more than 11 years ago | (#4589290)

It's not uncommon for new technologies to be implemented with the languages and on the platforms used by those that frequently implement new technologies: geeks.

I read another comment that Mozilla is already trying to implement something similar.

Don't worry, these things will eventually end up suitable for the masses. In the mean time, it's suitable for geeks. Most geeks know what Perl is and how to set up an environment that Perl scripts can run in. Other geeks may choose to port it to a language or platform more familiar to them. I believe something similar is already out there for Python.

This is OpenSource, after all, not a commercial product. If you don't like it, don't use it.

Normal people.. (1)

egarland (120202) | more than 11 years ago | (#4589215)

have the ability to learn new things.

Re:Normal people.. (1)

Ozymandias_KoK (48811) | more than 11 years ago | (#4589311)

Heh...your point is debateable. :)

Then again maybe I am confusing normal with average. But they should be the same, dammit!

Re:*BUT* it's a Perl script... (2)

Elias Israel (182882) | more than 11 years ago | (#4589276)

This is a very good point.

Truth is, to really tackle the problem of spam, a solution is needed that doesn't require the user to be a software engineer.

Plus, another problem with rolling out a Bayesian filter for a large collection of users is that each individual user needs their very own filter database. The statistical analysis of my mail would be nearly useless for anyone else.

OK, cards on the table: I am working on a new solution that will be useful for the general public and overcomes these problems.

Those who care to learn more can sign up to be notified when it becomes available.

Check out []

perlcc (3, Insightful)

Camel Pilot (78781) | more than 11 years ago | (#4589295)

I just received the November edition of the TPJ [] which included a fine article "perlcc & Compiling Perl Script".

In short, the filter script could be compiled to C and built to a native binary for a variety of platforms eliminating the need for a Perl interperter.

I don't get any spam (3, Funny)

Istealmymusic (573079) | more than 11 years ago | (#4589076)

Can someone explain why this filter would be useful to me?

your brain (0)

Anonymous Coward | more than 11 years ago | (#4589148)

you don't seem to use your brain either asking such questions, why would it be useful to you anyway?

Re:I don't get any spam (4, Funny)

moosesocks (264553) | more than 11 years ago | (#4589271)

Just post your email address, and we'll be happy to tell you.

bogofilter (4, Informative)

stype (179072) | more than 11 years ago | (#4589084)

This isn't exactly the first bayesian mail filter out there. I've been using ESR's bogofilter [] for weeks now, and I must say it works better than I could have ever imagined. Bogofilter however is simply for sorting out spam, while it appears this filter can sort out other things. But honestly, I can setup some simple filters to separate personal emails from work emails, so I'm not entirely sure the extra stuff is that useful.

What about decimal radix? (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4589162)

Decimal is bad. Use hexadecimal.

Re:bogofilter (2)

Theodore Logan (139352) | more than 11 years ago | (#4589241)

You quite obviously haven't checked out bogofilter's README [] . Let me quote:
  • This package implements a fast Bayesian spam filter along the lines suggested

  • by Paul Graham in his article "A Plan For Spam".
'Nuff said.

The best spam filter. (0)

Anonymous Coward | more than 11 years ago | (#4589090)

If you don't want spam then DONT USE AOL OR HOTMAIL!

Keep your email private and only give it to freinds and family. Set up a spamcop account to report any spam that does get through, and never 'remove' an email!

Ive never recieved a single spam in my blueyonder email account and rightly so.

IMAP (2, Insightful)

Evil Adrian (253301) | more than 11 years ago | (#4589109)

Does anyone know of any spam solutions for IMAP? Everything I've seen out there is POP3, but goddammit I like my IMAP folders!!! (Not to mention that the server on which my e-mail resides gets backed up nightly...)

Um. No. (1)

3-State Bit (225583) | more than 11 years ago | (#4589129)

Bad "spam"-like messages are bad. Good spamlike messages are not bad. A good spam-like message I consciously opted in to receive is indistinguishable from a welcome business proposal or newsletter.

Does this system know what businesses I've given my credit card to? Because EVERY ONE of those businesses has a right to e-mail me, so long as there is a clear opt-out link at the bottom of their e-mail.

If I trust a company enough to give it my credit card number, and I like it enough to do business with it, IT HAS A RIGHT TO SEND ME E-MAIL TO INFORM ME OF ITS PRODUCTS, as long as I choose to let it. Good businesses won't abuse the privilege, and I won't end up clicking the opt-out link.

The only thing this system is good for is filtering SOME penile-enlargement shady fly-by-night header-spoofing, open-relay-using shady shamster.

Oh, but that's the ONLY thing that the article defines as SPAM:

Let's take a quick look inside the mind of someone who responds to a spam
[sic]. This person is either astonishingly credulous or deeply in denial about their sexual interests. In either case, repulsive or idiotic as the spam seems to us, it is exciting to them.

So this is not spam-filtering software; rather, it's software to filter pornographic messages that fit a certain low-level sales pitch. Lovely.


Re:Um. No. (2)

judd (3212) | more than 11 years ago | (#4589196)

I think you have failed to understand how the filter works.

It is "trained" on a corpus of spam, which is compared to a corpus of known good messages. The important part is that YOU, the user, supply the spam corpus and the good messages. Thus in your case, as long as your "good spamlike messages" are in your "known good pile", similar new ones from the same source will not be tagged as spam. This is where the statistical approach shines over simple keyword matching.

Go on, read about how it works. You might learn something.

Re:Um. No. (2)

jjo (62046) | more than 11 years ago | (#4589203)

Well, if all spam is indistinguishable from the legitimate spamlike messages you want to see, then no filter will help you.

However, it seems more likely that a large proportion of spam is distinguishable from mail you want to see. It's quite plausible that you don't want to see messages about nympho sluts, or penis enlargement, or breast enlargement (or at least not all three), and that a naive Bayesian filter could easily distinguish these and other spams from mail you do want to see.

Re:Um. No. (1)

Fastolfe (1470) | more than 11 years ago | (#4589234)

Please read the article. Classification of messages is done by you. If you are routinely receiving pitches that you both solicit and arrive unsolicited, it might have a hard time differentiating, sure, but keep in mind that spam filtering is just one form of classification that can be performed here.

If you choose to set up a spam classification, and routinely file penis enlargement ads, the system will quickly learn that e-mails with words common to penis enlargement ads are generally going to always be classified as spam, and will file it as such. Other pieces of e-mail that share content with "legitimate" ads may be misfiled in your "legitimate pitches" folder.

You can set this up however you want it. It learns by remembering the words in messages you manually classify, so you are not taking their definition of "spam". You are setting up a classification that you call "spam" and it's keeping track of the types of things you put in there. It will then apply that to future messages.

product of marketrons (2, Interesting)

hfastedge (542013) | more than 11 years ago | (#4589133)

I don't know if it is true Bayesian

You know, on this issue, you really depress me. You are clearly not of the academic nature, so your stance toward something thats probably way above your head really frustrates part of me.

As long as you're not developing the idea, it shouldnt matter how it works as long as it works.

I read the original article here as you did to. After all the mumbo jumbo about learning, i picked out one effective tip from the article on filtering my email: filter out HTML.

With 1 line of regex I eliminate 95% of my spam:
match and throw it out.

Re:product of marketrons (1)

jez9999 (618189) | more than 11 years ago | (#4589278)

This may be great if you communicate 100% of the time with people using Unix systems. Unfortunately, quite a few rather stupid e-mail clients (Microsoft Outlook Express, Microsoft Outlook, Microsoft Word, notice a trend?) have HTML e-mail enabled by default, and your average user isn't about to turn it off. So if you're talking to average users, filtering HTML mail is not a good idea.

As effective as a well trained secretary (1, Insightful)

Gribflex (177733) | more than 11 years ago | (#4589154)

As I understand it, the Bayesian mail filtering system works by:
a) you receiving mail
b) designating where it should go
c) the filter tries to understand your reasoning
d) in the future, before step 1 occurs, the filter tries to interpret whether or not you want the mail based upon statistical analysis of what you have done in the past

Where as current mail filtering techniques work by culling your mail on exact specifications (it doesn't try to interpret. If it doesn't know, it does nothing).

I quite like the idea of my mail filtering software becoming intelligent over time, however I can see a potential for email traffic being lost using this method. The Bayesian mail filter is essentially as effective as a (hopefuly well trained) secretary. When you first get your secretary, she brings you everything. Then she starts culling the most obvious junk mail. Then she would start examining the normal letters... are they important? Relevant? Is this the person who should be dealing with it?

After time, you have your secretary very well trained, and she culls out everything which is not of immediate importance. In real life, this leads to the following problems:

a) you receive mail from an unknown source which could be important (some guy's discovered a new way to _________) but who isn't credible by your standards. His mail gets tossed aside, or redirected to someone else who probably doesn't care.

b) you receive mail from a trusted source at a bad address. i.e. your son is in Zimbabwe (sp?) on vacation. He sends you a letter postmarked from Zimbabwe, on museum letter head (couldn't find anything else handy). Knowing that you do not have dealings in Zimbabwe, and that this is most likely someone asking for charity, your secretary trashes it.

We've all heard stories of the first example, and it's not too hard to imagine the second. My worry is that, just like a good secretary, my mail filtering software will begin to filter for me. I will lose some control and, for the convenience of not having to hit the delete key a few extra times, I may miss potentially important email.

Chance is never a good thing to bring into your business.

Re:As effective as a well trained secretary (2, Insightful)

bmwm3nut (556681) | more than 11 years ago | (#4589266)

but, unlike your secretary not showing you things. you can just set up the filter to put the spam in a spam folder. you can then periodically look at it and see if there are any false positives. or you can tell the filter to delete things that are 95% spam, but put things that are still most likely spam in a special folder. that's what's great about learning algorithims, they can always adapt to what you want (if you teach them enough).

But secretaries use decimal . (-1)

Anonymous Coward | more than 11 years ago | (#4589275)

And nerds hate decimal. So, use hexadecimal, not decimal. Computers are good, they use hexadecimal.

Re:As effective as a well trained secretary (0)

Anonymous Coward | more than 11 years ago | (#4589277)

b) you receive mail from a trusted source at a bad address. i.e. your son is in Zimbabwe (sp?) on vacation. He sends you a letter postmarked from Zimbabwe, on museum letter head (couldn't find anything else handy). Knowing that you do not have dealings in Zimbabwe, and that this is most likely someone asking for charity, your secretary trashes it.

If you think your secretary would not know your son was in Zimbabwe you have never had a secretary.

If you ever get one you will experience that she/he quickly knows more about you than your wife does.

Re:As effective as a well trained secretary (1)

jez9999 (618189) | more than 11 years ago | (#4589303)

You just echoed exactly what I was thinking.

The problem I have with ANY e-mail filter is that there's always the chance that a genuine useful e-mail will accidentally be trashed. I'm not just saying this just for the sake of identifying a flaw; it's just the way I am that I would always feel twitchy about any e-mail going into a 'trash' folder without me looking at it to confirm it. And if I'm looking at it, I might as well not filter it at all.

And yes, that is how you spell Zimbabwe :-)

Not integrated solution (2, Insightful)

unfortunateson (527551) | more than 11 years ago | (#4589172)

What will make this thing work is if it is integrated with the e-mail client.

With this tool, you unfortunately have to manually add a message of a certain classification (work, pr0n, spam, family...) to the progrma through the perl script -- very awkward.

A tool like this need to run as a daemon and 'notice' when a message is added to a folder. Unfortunately, with different formats for e-mail folders, it's a much tougher job.

As it stands, with something like Outlook, I'd have to export each message individually, then run the Perl script. I can probably add a macro to do that (with its own pains -- you add a VBA macro to Outlook and it gripes every time you start up), and possibly even one that responds to filing in a folder.... hmm... maybe I will try this out.

You know what I'd kill for? (3, Interesting)

Saint Aardvark (159009) | more than 11 years ago | (#4589180)

A version of this for Outlook Express.

I work on the helpdesk of a small ISP; I also take care of the spam filtering, and answer abuse@. We recently added SpamAssassin, and God does it rock [] . (The big spike you see is me getting MRTG to graph what SA catches now; it's 6-10 times better than what we used to catch.)

But I still get complaints from our customers about spam that gets through. Just the other day a crapload got through because it was relatively subdued spam (no webbugs, NO LINE OF YELLING, etc); unfortunately, it also advertised pictures of young boys having sex. It's hard to explain why it's very, very hard to filter for this sort of thing, especially when I'm going through the talk for the nth time this week. (I need a good analogy that non-geeks can understand; I'm still looking.)

The good folks at DeerSoft [] have a version of SpamAssassin for Outlook, and are promising one for OE Real Soon Now. But I would loooooooooooooooooooooooove a good spam program -- this or SA or something else -- that I could point our customers to. Download, double-click, say yes, and bam it's installed. I can figure out how to install this on a Unix box; I could probably, eventually figure out how to do it on a Windows box; there's no way the customers could do it.

Or am I missing good, free spam filtering for Windows? Can anyone point me in the right direction?

Slightly OT: There has got to be a huge market for setting up spam filtering for small businesses. My idea: Tell 'em that if they provide the box -- an old Pentium or 486 will do -- I'll set up spam filtering and a firewall on it, set up some maintenance tools (whitelist this, firewall that). They get great mail service, I get $x00.

Re:You know what I'd kill for? (0)

Anonymous Coward | more than 11 years ago | (#4589284)

> A version of this for Outlook Express.

Available. Check out the PopFile manual at

It explains how to configure it for Outlook Express.

Re:You know what I'd kill for? (0)

Anonymous Coward | more than 11 years ago | (#4589292)

what i'd like is a program that identifies spam and bounces an error message back to the sender "recipient unknown" or whatever. that way, the adress at some point in time gets deleted from the database.
i know, the reply-to-adress is often forged, but if this could be done, you could not only block spam, but also reduce it.

CRAP (0, Offtopic)

gnillort (617577) | more than 11 years ago | (#4589206) is a bad site!
don't go there!

Uhmm.. like bogofilter? (3, Informative)

Jamuraa (3055) | more than 11 years ago | (#4589231)

Bogofilter [] has been out since august, and does this bayesian spam-stuff in C, which probably will run a bit faster than the perl or python versions just because of it's compiled-ness. I've never run it myself, but people on debian lists say it works better [] or not as good [] as spamassassin [] .

Risk management (1)

hansroy (575558) | more than 11 years ago | (#4589238)

Finally, paying attention in those statistics & risk management courses pays off!

Python Port Needed! (-1, Flamebait)

Anonymous Coward | more than 11 years ago | (#4589248)

I will not touch a Perl program because Larry Wall has become a delusional fruit loop.

Statistics are cool. (1)

Fuzzums (250400) | more than 11 years ago | (#4589252)

I write a simple script to recognize languages by their letter frequencies. [].
this methis isn't very strong, but with a fair amount of input the resulte get better. it even recognised the difference between dutch and a dutch dialect. the problem was that the alphabet only hat 26 characters, so i came up with the idea of using letter pairs.

when i read the article it was really funny. the methids he uses are almost the same as my method. and when i read about using word pairs: LOL.

this will be a very cool sam-filter. i love it already.

Theft (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4589254)

Ahhh, the exciting life in Toledo...

Ximian Evolution? (1)

Namtar (618076) | more than 11 years ago | (#4589258)

This looks really good. Anyone out there know how/if it can be used with ximian evolution?

Spam will be spam (1)

dazdaz (77833) | more than 11 years ago | (#4589298)

I get tired of copy and pasting spam emails into spamcop from the same ISP's. I use The Bat! quite a lot, any suggestions?
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?