Gmail Spam Filter Testing 285
An anonymous reader writes "What can you do with 1000MB of e-mail space on your Gmail account? One guy, by the name of Aaron Pratt ( prattboy@gmail.com ), has decided to test the spam filters of Google's Gmail service by having his Gmail account blasted with every kind of spam imaginable. He is testing to see how well Gmail's spam filters can sort out the spam from legitamate email (yes, he does get personal emails from people). As of May 25th, he was at about 30% of his Gmail account's 1GB capacity. You can track his progress on his website, http://gmail.prattboy.net (Google cache of this site: cache: gmail.prattboy.net). Here is also an article talking about Aaron's efforts from webpronews.com"
first spam? (Score:5, Funny)
Re:first spam? (Score:4, Funny)
More focus on false positives. (Score:5, Insightful)
The consequenses of blocking a non-spam email are so much worse (parent not hearing from kid. the customer that would have saved your startup.) than a spam getting in, I wish the spam filter reviews would focus on those.
Re:More focus on false positives. (Score:4, Informative)
A false positive is not one of spam getting past the filter, it's one of non-spam getting blocked.
I.e. the filter says it's spam, and it isn't - in the same way that a false-positive medical test says you have a virus even when you don't.
Re:More focus on false positives. (Score:3, Informative)
False negative = condition you are testing for comes up negative, when it should be positive.
Put in the context of a spam filter, it depends on whether you are testing for spam or for legitimate emails. If you are testing for spam (if spam then...), a false positive would be an email that is not spam getting sent to the spam folder or deleted. A false negative would be spam that lands in your inbox.
The Filter is great! (Score:5, Funny)
pre-emptive strike theory (Score:2)
Re:pre-emptive strike theory (Score:5, Funny)
Re:The Filter is great! (Score:5, Funny)
They just forgot the mutex surrounding the two snprintfs... so this user probably got 139 messages in the time it takes to execute snprintf, all spam.
Which is.... about right.
Re:The Filter is great! (Score:5, Informative)
One of the best things Google/GMail could do (Score:5, Interesting)
Re:One of the best things Google/GMail could do (Score:5, Informative)
Re:One of the best things Google/GMail could do (Score:5, Funny)
Re:One of the best things Google/GMail could do (Score:5, Interesting)
The only reason I could think of someone sending those around is to bog up Bayesian filters with random crap, possibly lowering their effectiveness.
Any spammmers/spam-experts feel like enlightening us?
Re:One of the best things Google/GMail could do (Score:4, Interesting)
So either it's some kind of probe to find working addresses, or a filter clogger. Or maybe both.
For a few of the random emails I would later start getting "real" spam. Not a majority though.
Re:One of the best things Google/GMail could do (Score:5, Informative)
Re:One of the best things Google/GMail could do (Score:4, Interesting)
Re:One of the best things Google/GMail could do (Score:5, Insightful)
Except that won't work, as anyone that understands Bayesian filtering will tell you. In the case of every message with "random words" I've checked recently, the random words actually increased the spam score of that message. Why? Because it seems the random words aren't so random and either the same spammer is using the same "random words" over and over or various spammers are using sets of the same words. Over time most of the "random words" they use actually become great indicators of spam since my real email doesn't typically contain the random words they use.
In one recent analysis, 10 random words were inserted by the spammer. He got lucky and 1 of those words actually had a very low score for my Bayesian corpus. Unfortunately (for him), the other 9 words had scores of 99.99%! His use of random words literally nuked any possibility of him getting through my filter.
Anyway, random words will not help spammers get through Bayesian filters. But it seems that many people (both spammers and non-spammers) think it will. But, hey, that's good for me: as long as "random words" is seen by spammers as a viable solution to Bayesian filters, my Bayesian filter will continue to work and will not have to deal with any innovative way to get around the filter (if any exists).
Re:One of the best things Google/GMail could do (Score:4, Insightful)
New spin on the "word salad" strategy (Score:5, Interesting)
Right, and my Thunderbird Bayesian filter catches all of those word salad approaches. But they've come up with a new one - what I call the "encyclopedia attack."
What they do is copy an encyclopedia entry and put it at the bottom of their spam. The thing is usually a few paragraphs long, so that textually it dominates the message. The subjects are fairly random, and are occasionally educational ;)
The problem is that the text of this doesn't trip the "too many strange words" flag that's used for word salads. My Thunderbird filter is really having trouble with these. Anyone else having trouble with these spams?
Re:One of the best things Google/GMail could do (Score:2, Informative)
I get those in Eudora and they don't seem to do much, my friends with Outlook however... not so lucky.
Re:One of the best things Google/GMail could do (Score:3, Informative)
Re:One of the best things Google/GMail could do (Score:3, Informative)
Re:One of the best things Google/GMail could do (Score:5, Insightful)
heh heh...abdolutely.
100 known good addresses are worth 10,000 "who the fuck knows" addressess.
>>It's cheaper to just send mail to everyone
no it's not.
let's pretend you are a spammer, and you want to send out spam.
If you target 1 billion questionable addresses, each time a client has a new campaign, then that's 1 billion pieces you have to deliver. every time.
what if you have 1000 clients? that's 1000 billion deliveries.
do you see where this is going? if you don't KNOW WHAT A VALID EMAIL ADDRESS IS, YOU HAVE TO GUESS.
but what if the first time you send out just a "test" to those billion addresses, and then subtract the one's that bounce.
You are left with 50,000 known good addresses.
that's gold. You now have 1/20th of the load,and you are now serving your clients quicker, a helluva lot less load. you are only using an open relay for 1/20th of the time.
overall a smaller footprint by 1/20th.
you tell me. does it make sense to blindly blast out email?
Re:One of the best things Google/GMail could do (Score:3, Informative)
>no it's not.
It doesn't matter how cheap it is when 80% of spam supposedly comes from infected zombie computers. (I'm too lazy to actually LINK to the recent story on this.)
Re:One of the best things Google/GMail could do (Score:2, Insightful)
Google also has enough computer power to generate some sort of Bayesian filter to catch the most common spam system wide, and even a personalized filter on each account to catch the rest.
Spam is always personalized (Score:5, Informative)
This doesn't mean it wouldn't be possible to create a system which would automatically detect individual spam messages based on tagging known spam, you just have to be smarter about the detection than just plain MD5ing the email body.
Re:Spam is always personalized (Score:2, Interesting)
Re:Spam is always personalized (Score:5, Informative)
Not necessarily.
Lempel-Ziv based algorithms, like the one used by gzip, build a compression dictionary on the fly. Any "personalization" added to the message will affect the dictionary to varying degrees from then onward. If it's near the beginning, the personalization would greatly skew the selected dictionary identifiers. Though probably this would have little effect on the actual compression of the data, it would radically change the representation of the compressed image. The farther this personalization is from the start of the data to be compressed, the less effect it will have.
Image-Based Spam and Checksums (Score:3, Interesting)
What about vetting at least the image-based spam for checksumming? Scan the e-mail for image links (or images included inline). If there's a link, check it against the known list of spam links. If it's in the list, mark the message as spam. Spammers will quickly figure that trick out, though, so step two would be for Google to follow those links, and retrieve the images. Run a checksum of the image file itself; if there are a lot (say, a thousand) messages including the same image, tag it as spam. Thi
Re:One of the best things Google/GMail could do (Score:2)
Re:One of the best things Google/GMail could do (Score:2)
Yes they do, this is just one of the articles discussing this, here. [nytimes.com]
They have a much higher ratio of PhDs than Microsoft, or just about anyone short of a hospital. They also give their employees the freedom of spending 20% of their time working on any unrelated subject they choose, appearantly in the hopes that the outcome of this research will benefit Google, or at least will make the better PhD's with more than one iron in the fire, WANT to work for them.
Re:One of the best things Google/GMail could do (Score:5, Funny)
They have a much higher ratio of PhDs than Microsoft, or just about anyone short of a hospital.
Remind me not to go to your hospital. I want MDs treating me, not people who can give me a dissertation on ancient Sumeria or something. (MDs who also know about ancient Sumeria excepted.)
He gave out his e-mail address... (Score:5, Funny)
Oh... right.
Re:He gave out his e-mail address... (Score:5, Funny)
Re:He gave out his e-mail address... (Score:5, Interesting)
Well, I guess I need a booster shot, so here it is: slashdot@hates.ms. Spam away...
Re:He gave out his e-mail address... (Score:3, Funny)
whining? (Score:5, Insightful)
Now you're complaining that your free, 1GB-limit, access-from-anywhere email service could be mailbombed? Live with it. If Google "decides" anything more about our emails, we put on our tinfoil hats and scream. If we broadcast a bogus email address, obtained from gmail for clearly sinister purposes, and it gets mailbombed, we whine that Google doesn't "protect" us. Whats the story, or are we all just schizophrenic?
Don't want that "vulnerability"? Don't use Gmail!
Re:whining? (Score:5, Insightful)
I don't think its about protection just practicality. Google offers a SPAM filter the littel pratt tested it and found it wanting.
I think its more of a problem for Google than the end users. The whole Gmail "get a gigiabyte of memeory free" business model is predicated on most people using only a small fraction of that Gigibayte but felling good about the capacity being there. If I open up a gmail account, get p*ss*d of with the spam and go elsewhere without closing the account the 1G will fill up with spam in a couple of months, Google will end up storing terabytes of spam for cutomers who no longer use the service.
Re:whining? (Score:2)
Re:whining? (Score:3, Interesting)
Re:whining? (Score:4, Informative)
Re:whining? (Score:3, Interesting)
Of course, google should improve and filter out the occasional crap I get too. And also offer 1 TB.
Re:whining? (Score:5, Informative)
Why?
Google uses commodity IDE drives. Those retail for about fifty cents a gigabyte. Google's not paying retail.
I read a quote from a Googleperson that by the time the drive is installed in a system, powered, cooled, backed up and administered Google is paying two dollars for a gigabyte.
Good point about the problem of abandoned accounts, which won't bring Google any ad revenue. Wouldn't be surprised if they start euthanizing inactive accounts.
Re:whining? (Score:5, Insightful)
That is his JOB, to point out shortcomings of the system. He is a tester, and he is doing it for FREE. Google doesn't want testers who get 3 emails a day, they want people to test the living shit out of the service and point out what is wrong with it. Everyone knows Google will try to fix all the bugs, so all the press, good or bad, is still good press.
If Google barfs when handling 999 messages in 4 minutes during testing, image when several million people have gmail accounts. Fortunately, now Google has an even to look at to see what the problem is. When you are trying to harden a system, YOU MUST BREAK IT OVER AND OVER AGAIN, to see where it is weak. This is what is happening.
My impression is that the tech's at Google are spending a significant amount of time saying "oh shit, never thought of that, cool." which is the ENTIRE REASON FOR TESTING. They can't think of every situation by themselves. This is also the entire concept behind "open software is more secure". Google's gmail is going to have bugs at this stage and lots of them, period. Google knows this, hell, everyone knows this (this is why its in testing, and not open to the public yet, duh)
It not whinning, its stating the facts, which Goggle obviously WANTS him to gather, as a TESTER. Seems to me that he is going beyond the call of duty to test their servers, since he is spending a fair amount of his own time.
Re:whining? (Score:3, Funny)
Slashdot operated under that philosophy for the first 2-3 years of it's existence. ;-)
Re:whining? (Score:2)
Yeah, we all have multiple personality disorder. Luckily we also have multiple bodies, so we dole these personalities at around 1 per body.
You're complaining about the lack of consistant thought from a crowd of random web surfers...
If this guy has used 30% of his capacity... (Score:3, Insightful)
Re:If this guy has used 30% of his capacity... (Score:5, Funny)
God damned metric system.
gmail still beta (Score:2, Insightful)
Re:gmail still beta (Score:5, Funny)
>spam filtering techniques a little premature?
What part of Beta TEST escapes you here?
Re:gmail still beta (Score:2)
The fact that this guy posted, on hise website, for the net-world to see is just his way of giving the net-world an update. On a personal note, I think it was nice of him to do such. Especially since he will have to kill that e-mail account after giving it to
Not a fair test (Score:5, Insightful)
Re:Not a fair test (Score:4, Insightful)
Re:Not a fair test (Score:4, Informative)
If you don't have a PTR record associated with your host, try to send mail to them, or malform your EHLO or something else.
You don't need to be "really sure" mail is spam- I'm talking about doing things like standards complaince checking, which will result in mail being rejected at delivery time.
Is this just random theorizing, or does GMail really fail to deliver some emails it thinks is spam?
There's no reason to get insulting. RFC 2821 has a number of requirements for delivery of mail that many services ignore.
Should be interesting, what filters? (Score:5, Interesting)
I'll help (Score:5, Funny)
News... (Score:5, Funny)
Since we are talking about spam and obtaining more spam, I don't know if I should read the site the article is on as "web pro news dot com" or "web pron ews dot com"...
I guess I'll figure it out sometime.
Not that impressive (Score:5, Informative)
I am starting to second guess whether I should transfer everything to my Gmail account.
Re:Not that impressive (Score:5, Insightful)
Re:Not that impressive (Score:5, Funny)
What was that address again?
Re:Not that impressive (Score:2, Informative)
Mine is kredal@gmail.com, if you're interested. (:
I redirected an old address (Score:3, Interesting)
No, I'm not keeping proper statistics. =b
Re:Not that impressive (Score:4, Funny)
Please only email me if you're barely legal and running a webcam. Thank you.
Re:Not that impressive (Score:5, Informative)
(example, after two weeks of using spam-assassin, it decided that every e-mail sent to me was spam.. i no longer received anything in my Inbox, everything was transferred to the Spambox. It took me another two weeks tweaking spam-assassin's kill rate down to about a 50% accuracy, and now i actually receive all my emails.)
Re:Not that impressive (Score:2)
Lately, I've received legitimate e-mail from someone else with a Yahoo Mail account. Spam Assassin mark them as Spam.
However, my mail client, Mozilla Firebird, does not mark them as Spam... So it stay in my Inbox (even if it contain the Spam Assasin Header and modified title).
Actually, the Firebird Spam filt
Re:Not that impressive (Score:3, Informative)
Rubbish - I've used thunderbird for many months now, with an account that gets quite a bit of spam. I have yet to see thunderbird make a wrong guess at whats spam and whats not. If anything, thunderbird is more likely to go the other way - allowing spam through - than deleting real email.
Re:Not that impressive (Score:2)
Should gmail be filtering all emails (Score:2, Funny)
I just want.... (Score:2, Funny)
What is the big deal? (Score:2, Insightful)
Is this the AventureMail guy? (Score:5, Interesting)
My own gmail testing (Score:5, Informative)
It was from " Mr Jubril Udeh Manager of Credit and Accounts Department of North Atlantic Securities Sarls Lome-Togo Republic."
Now, the funny part is not that the mail made it through, but that google also decided to show me contextual ad's on that account. Currently, the ads are:
- Payroll Cards a Poor Substitute for Checking Account
- Tips for Tackling Check Fraud
- Sophos hoax description: Ethiopian airline letter
- FAP non-US Investment FAQs
In the past the mail has also shown me ads on how to open an off-shore bank account. I'm glad google is willing to help me with the $10.5 million dollars that I'm about to receive!
Re: (Score:2, Interesting)
Hmmm.. weird stats... (Score:2, Interesting)
His last week stats are:
Something is off... Unless his spams contain attachments, this says that each of his emails were 17 MB in size each.
I mean 17.73708.. This is /. afterall. :)
Re:Hmmm.. weird stats... (Score:5, Informative)
3778 messages / 213 MB = 17.37 messages / MB
213 MB / 3778 messages = 0.0564 MB / message
So that's pretty reasonable.
About spam and blocking (Score:5, Interesting)
-A
*just for those who didn't know, the above domain names and email accounts are random, any resemblence to an actual domain or email account is purely coincidental, and if you choose to do so, you should sue
Re:About spam and blocking (Score:2, Interesting)
So this allows you to block some domains, if you'd like.
Re:About spam and blocking (Score:3, Insightful)
1gb Relieves Spam Concerns (Score:5, Interesting)
_____________________
Seun Osewa, Abeokuta Nigeria [seunosewa.com]
If you get mailbombed... (Score:2)
Select all the messages that it displays as able to be included that you've already archived (one click).
Select "Move to trash".
Viola.
Viola (Score:5, Funny)
gmail spelling (Score:5, Funny)
How about having Slashdot editors/Hemos test the gmail spell checker too?
This won't work for me... (Score:2, Funny)
How is he compiling stats? (Score:2, Interesting)
I guess because his stats are about 2-3 weeks behind, it would indicate that things are leaning towards the manual procedure...
0% Spam (Score:5, Interesting)
This guy solicited it.
Filtering could use some help (Score:2, Funny)
Lack of updates? (Score:5, Interesting)
KevG
It's going to get a lot better... (Score:5, Interesting)
Cache? (Score:5, Funny)
Re:Cache? (Score:5, Funny)
Wow (Score:5, Funny)
Paid yahoo is better (Score:3, Insightful)
People that don't pay for Yahoo don't seem to get such good spam filtering, though.
Google can definitely do better.
Calculations? (Score:3, Insightful)
I think somebody needs to recalculate axactly how much bandwidth go to waste because of this SPAM plague. The cost in global comms traffic must be staggering!
Now my friends are spamming me (Score:3, Funny)
Dumb question about SPAM filters.. (Score:4, Interesting)
1) Several intentionally mis-spelled words
2) Lots of text in white (so it's invisible or nearly invisible)
3) Message in
Could you add filters that look for, say, more than 10% of the words mis-spelled, text font nearly equal to background color, or no actual text in message? These would take effect in addition to the existing Bayes filter.
Aventuremail not as tolerant (Score:4, Interesting)
Yale Story (Score:3, Interesting)
Re:There's Epic Imagery Here Somewhere (Score:2)
Re:How to never get spam (Score:5, Funny)
Just to get you started, I'll give you a quick hint: virtually every internet discussion on spam includes some high and mighty moron that claims that by not giving out his email address, he never gets spam.
The problem is, that for every one of those, there are plenty more who follow the same precautions and yet get plenty of spam to those accounts for a variety of reasons. Clearly, your soution is not the answer to "how to never get spam."
A good rule for using the internet is to read a few discussions before you post. This way, you will be less likely to post something that makes you look naive. So sit back, relax, and enjoy a steaming hot cup of STFU while you read and learn!