Asynchronous Programming for Spam Elimination 63
ttul writes "Stas Bekman (formerly the maintainer of mod_perl) has been quietly building an asynchronous programming framework to build high performance network applications in Perl. His recent Perl.com article describes how he has used the Event::Lib module (that lives on top of the popular libevent library) to write a traffic-shaping email proxy to get rid of spam. Asynchronous programming is challenging at the best of times. Read on to find out how to do it the easy way in Perl."
Not ultimately a solution (Score:5, Insightful)
the core assumption, and the only thing that makes this work, is that botnet spam software will _always_ just give up after 30 seconds; if this throttling technique ever became commonplace, spammers would just write their own asynchronous mailer -- it's not THAT hard. windows has the same kind of async networking support (either through the winsock API and/or IO completion ports, or what have you) and i'm sure the spam/botnet software authors have no qualms about holding open a couple thousand sockets on the rooted windows machine (times a few hundred thousand machines.) furthermore, i bet there are some shitty legitimate MTAs that would just give up too, causing actual mail to get discarded
(that, and they shoulda used twisted [twistedmatrix.com] or something
ok, ok, maybe this sounds overly critical. it's a clever, thinking-out-of-the-box idea, but certainly not the panacea we're looking for to stop spam.
-fren
Even easier.... (Score:2)
They can just make the SPAM program multithreaded and start a new thread for each new connection (each using *synchronous* IO).
Theres no interprocess communication involved, it should be trivial.
Re:Not ultimately a solution (Score:4, Insightful)
They all rely on the "we only have to be better than the neighbour's mailserver" principle. Until everyone starts doing it these things work and then new methods get invented to combat spam. Not that suprising, but saying no to this approach is basically silly. There is NO good way to eliminate spam, because stupid people exist. So people hack around the problem.
Re: (Score:2)
Yes, it does in fact work (Score:5, Informative)
You make some very good points -- and these are all concerns we had when we set out to build this software.
Fortunately for the world, these concerns have turned out to be unwarranted. Furthermore, our experience in actually deploying this technology has been far more breathtaking than we had imagined -- both in terms of spam mitigation and improvements in scalability.
> the core assumption, and the only thing that makes this work, is that botnet spam software will _always_ just
> give up after 30 seconds;
I have a theory that spammers will always be impatient. I believe this theory for several reasons:
1. Spam campaigns are now recognized by anti-spam companies in minutes or hours. New campaigns therefore have a very short life expectancy and have to be completed as fast as possible. If mail can't get delivered fast, it's time to move on to a new domain to get it moving again. With collaborative filters like Cloudmark recognizing campaigns in less than 60 seconds, spammers obviously have to move traffic fast.
2. Botnets are not unlimited in their size or bandwidth capacity. Typicaly botnets these days are between 1,000 and 10,000 hosts. Any larger and the command and control channels are very quickly noticed and shut down by service providers. Botnets cost money too -- $250/hour for a 10K botnet is typical.
3. Spammers raison d'etre is to send lots of mail and hope that a small percentage of recipients buy something. The only way to make the business profitable is to send huge amounts of mail. If all zombie traffic in the world was magically being slowed down, spamming would no longer be profitable and spammers would tend to focus more on things like highly targeted phishing instead. Not surprisingly, we're already starting to see this.
4. Because #3 isn't going to happen any time soon, and in light of the technical constraints (1 and 2), spammers have no choice but to abort their connections within a very short time frame. It's just the nature of the economic beast. Hanging on is just for posterity. It doesn't make economic sense.
5. It works. And it's very very scalable. By slowing down traffic and multiplexing what remains, mail server load drops by 90%. In big installations, that means no more being paged in the middle of the night because your cluster of 4-way Xeons with 8GB of RAM is borked by a distributed spam burst.
Oh -- and of course you can't just slow everything down. It's important to be very selective so as not to delay everything.
> if this throttling technique ever became commonplace, spammers would just write their
> own asynchronous mailer -- it's not THAT hard...
Actually, it is that hard. Even Stas got a headache working on this project.
But even if it was easy, it would be pointless for a spammer to launch more than one connection per zombie. If a sender is marked as suspicious, the sender's concurrency is severely limited. One connection per zombie, at 5 bytes per second -- that's just not economic.
> furthermore, i bet there are some shitty legitimate MTAs that would just give up too, causing actual
> mail to get discarded
Let's just say the gap between the patience of spammers and the patience of legitimate MTAs is very large indeed. And by carefully fingerprinting and assessing sender reputation, this problem can be minimized to the point where it is a far smaller problem than content filter false positives.
I also want to point out that this technology does not make email suck by slowing it down. It in fact speeds up delivery of legitimate mail in most cases because the load is so reduced on the rest of the infrastructure.
Just talk to our customers. One of them was running four 4-way Xeon boxes with 8GB of RAM each -- all this to service the spam filtering needs of just 10,000 end users. He told us he hadn't slept a full night in months because of load-based outages. Since installing the software Stas built, the only alert he's received is a notification that the load level dropped below the panic threshold!
Re:Yes, it does in fact work (Score:4, Informative)
Re: (Score:2)
Agreed. My ISP *finally* added greylisting this year. This is the account I use on my domain registrations, so the email address shows up in whois. It therefore gets an insane amount of spam. After testing out the greylisting for a couple of weeks, I saw no perceptible difference in the amount of spam I was receiving.
When you greylist, you're basically using SMTP rules to tell the sender "try again later". As this became more common
Re: (Score:1)
That penny stock spam is the most successful I've seen. More than half of it gets past gmail's filters and into my in box, and then more than half of that gets past my own filters. It's just about the only spam that makes it through, but I get several of those a week. (I also checked the stocks, and not a single one has risen significantly, despite the spam's assurances ;-)
Re: (Score:1)
I don't even use greylisting anymore because it gets in the way of me troubleshooting mail problems, and has negligable affect on SPAM anymore.
Re: (Score:2)
Re: (Score:1)
spamd gave us our initial inspiration. I talked with Bob Beck at the Cansecwest security conference [cansecwest.com] after he presented on spamd and was -- to put it mildly -- blown away.
It's important to understand that spamd does not actually deliver mail. It just responds r e a l l y s l o w l y and then returns a 400-series code to force the sender to try again. After the first time, a packet filter rule is added that redirects that sender to a real MTA, which receives the mes
Asynchronous programming ... (Score:5, Funny)
oblig. checklist :) (Score:5, Funny)
(X) technical ( ) legislative ( ) market-based ( ) vigilante
approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)
( ) Spammers can easily use it to harvest email addresses
(X) Mailing lists and other legitimate email uses would be affected
( ) No one will be able to find the guy or collect the money
( ) It is defenseless against brute force attacks
(X) It will stop spam for two weeks and then we'll be stuck with it
( ) Users of email will not put up with it
( ) Microsoft will not put up with it
( ) The police will not put up with it
( ) Requires too much cooperation from spammers
(X) Requires immediate total cooperation from everybody at once
( ) Many email users cannot afford to lose business or alienate potential employers
( ) Spammers don't care about invalid addresses in their lists
( ) Anyone could anonymously destroy anyone else's career or business
Specifically, your plan fails to account for
( ) Laws expressly prohibiting it
( ) Lack of centrally controlling authority for email
( ) Open relays in foreign countries
( ) Ease of searching tiny alphanumeric address space of all email addresses
( ) Asshats
( ) Jurisdictional problems
( ) Unpopularity of weird new taxes
( ) Public reluctance to accept weird new forms of money
(X) Huge existing software investment in SMTP
( ) Susceptibility of protocols other than SMTP to attack
( ) Willingness of users to install OS patches received by email
(X) Armies of worm riddled broadband-connected Windows boxes
(X) Eternal arms race involved in all filtering approaches
( ) Extreme profitability of spam
( ) Joe jobs and/or identity theft
( ) Technically illiterate politicians
( ) Extreme stupidity on the part of people who do business with spammers
( ) Dishonesty on the part of spammers themselves
( ) Bandwidth costs that are unaffected by client filtering
( ) Outlook
and the following philosophical objections may also apply:
( ) Ideas similar to yours are easy to come up with, yet none have ever
been shown practical
( ) Any scheme based on opt-out is unacceptable
( ) SMTP headers should not be the subject of legislation
( ) Blacklists suck
( ) Whitelists suck
( ) We should be able to talk about Viagra without being censored
( ) Countermeasures should not involve wire fraud or credit card fraud
(X) Countermeasures should not involve sabotage of public networks
( ) Countermeasures must work if phased in gradually
( ) Sending email should be free
( ) Why should we have to trust you and your servers?
( ) Incompatiblity with open source or open source licenses
( ) Feel-good measures do nothing to solve the problem
( ) Temporary/one-time email addresses are cumbersome
( ) I don't want the government reading my email
(X) Killing them that way is not slow and painful enough
Furthermore, this is what I think about you:
(X) Sorry dude, but I don't think it would work.
( ) This is a stupid idea, and you're a stupid person for suggesting it.
( ) Nice try, asshole! I'm going to find out where you live and burn your
house down!
Re: (Score:3, Interesting)
Please provide details (Score:2)
Re: (Score:2)
(X) Mailing lists and other legitimate email uses would be affected
And your point is? If I have to give up 'mailing lists', or (far more likely) force mailing lists to change so that they are NOT so similar to spam that they get caught by anti-spam stuff that is not a real issue. We do NOT owe Mailing Lists the right to exist if they can't change to deal with the reality of a spam-free world, tough luck. The effect on other legitimate email uses would b
Challenging at the best of times? (Score:3, Funny)
As far as I can tell... (Score:1)
must we rename everything every time that someone "discovers" it?
AJaX (Score:2, Informative)
Except "asynchronous programming" is already a well-known term among many web developers:
Asynchronous Programming with
JavaScript, HTML DOM,
and
XMLHttpRequest
Re: (Score:2)
Yes, because, that way, you get publicity. If you just quietly sat and implemented it, it would be every bit as great, but nobody would hear about it.
As if PERL wasn't hard enough to read... (Score:3, Funny)
Re: (Score:2)
Re: (Score:2, Informative)
We looked at using the new Perl threads, but Perl 5.8 threads suffer from a few severe limitations.
1. When you create a new thread, a complete copy of the interpreter is made. The new thread makes use of this new interpreter instance and cannot communicate with the original thread except via the threads::shared module or some traditional IPC mechanism. In short, they're no better than forking a new process and in many ways, they are far worse than this.
2. P
Re: (Score:2, Interesting)
Re: (Score:1)
It would have been too difficult to make POE rock performance-wise in addition to ensuring that POE used an efficient event library like libevent.
And in this kind of application, you need awesome performance. We profiled the app with strace for weeks to get rid of unnecessary system calls.
Re:As if PERL wasn't hard enough to read... (Score:4, Informative)
Er, how ? Because they don't really use threads ? Sure, they're fast and lightweight...but since they don't use the underlying OS's threads implementation (ie, kernel-compatible threads), they're only marginally useful on multiCPU and/or multicore systems.
Whats your basis for that statement ? Have you tested the latest versions of the threads [cpan.org] and threads::shared [cpan.org] modules ? Some significant effort has been applied in the past year to improve stability, as well as reduce footprint...you might want to give it a look...
Perhaps if your org can get some funding, you might throw some money at the TPF to get iCOW implemented ? Which should vastly improve thread startup and reduce footprint. threads::shared remains a bit of a challenge, but that issue can be addressed by some carefully crafted XS (which I'm told Stas is pretty good at ;^).
Re: (Score:1)
it works for basic light things, but if anything complex is used, like mod_perl, it segfaults all over and if it doesn't it takes dozens of seconds to start a new thread under heavily loaded machine (due to lack of CoW as you've mentioned, but even then I doubt it'd be much of help, since it'll still need to copy a lot of data)
And yes, someone needs to work on fixing those and a TPF grant would be very helpful.
Re: (Score:1)
Perl threads is also very easy to understand.
Simply put, nothing is shared between threads.
If you want to share data between perl threads you must explicitly say so:
my $foo : shared = 1;
though if you're stuck with perl version 5.6.0, dont use threads.
"... the easy way in Perl." (Score:2)
Re: (Score:1)
Re: (Score:1)
Consider this an opportunity to learn how to write maintainable code.
Re: (Score:1)
Re: (Score:1)
There's more than one way to do it, in Perl, so choose the most maintainable. Problem solved.
Before you counter "But I have to maintain code written by monkeys, and it's hard to read," consider not hiring monkeys to write code you care about. Not even Haskell or Java or Ruby prevents monkeys from writing bad code. The problem is, they're monkeys, not that they're using the wrong language.
Re: (Score:2)
Unfortunately what is not so easy is understanding what you did six months ago.
If a programmer cannot go back into code he wrote six months ago and figure out what is going on, the blame rests with the programmer. The language is irrelevant.
Re: (Score:1, Insightful)
Clever, but... (Score:2, Interesting)
Delays aside, I just can't buy into network-layer rate limiting when it comes to email. The metric for anti-s
Re: (Score:1)
> The inherent delays for just about every message would be particularly painful for business email users, but
> even residential ISP customers are constantly opening tickets when they observe a delay (I work closely
> with several large ISPs, which is how I know).
That would be a problem if every single message was slowed down, but it's not. The system uses sender reputation and behaviour to ensure that only malicious senders are slowed down. Our cus
Re: (Score:1)
I don't recall any mention of that in the article, but I guess it may have been a bit outside the scope. Either way, I didn't realize that - makes sense.
Freq. distribution of mail transfer agents (Score:1)
Most if not all mail transfer agents no longer operate as open relays by default, a problem which used to be the main contribution to spam. People blamed the complexity of Sendmail for that and other problems, so many distros moved to other mail transfer agents for their default. A few years ago Sendmail was still about 65% of the mail servers.
What is the current marketshare of Sendmail now and what is the frequency of others like Exim, qmail, and Postfix?
Re: (Score:1)
Re: (Score:1)
"High performance" , "perl" , sorry? (Score:2)
Don't make me laugh. Something this CPU and I/O intensive should
be written in C/C++ or even assembler at a push , not a scripting
language. Seems to me this project has been written in perl for
the sake of writing it in perl , not because it confers any
advantages over doing it in a lower level language.
Re: (Score:2, Insightful)
Remember kids, if your process is IO-bound, you want the fastest possible code ever to make sleeping on those system calls as efficient as possible!
Re: (Score:2)
You must be really amazing to be able to determine that a given application can't possibly be usable when written in language X and would be much better in language Y without any data or firsthand experience using the application.
Sometimes, things work just fine even though they'd be 20ms faster if written in C/C++.
- Roach
Re: (Score:2)
process that data fast.
Re: (Score:2)
links to major stock exchanges such as LSE, NYSE, Euronext etc
for the last 3 years, what would I know.
Go and play with your little Perl toy pal , and leave the real
coding to those of us who have a clue.
Re: (Score:3, Insightful)
At least your comment is the msot silliest I have ever seen. What will a mail filter/forwarder do 90% of its time? NOTHING, being blocked listening on a socket. It realyl does not matter if the listening process is written in assembler (granted, which is very portable from sparc to i386 to PowerPC) and jsut waits "
Re: (Score:2)
That is file io, process management and text processing.
It makes no sense to pull out a set of benchmarks where some "nerds" wrote mandelbrot programs and n-body gravity simulations to prove that C is faster. Of course a portable assembler language, using native datastructures (arrays!) is faster than PERL, no one doubts that. but my parent was of the opinion that PERL is so slow that it is suicide to
Greylisting is the answer (Score:1)