Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Efficient RSS Throttling

jamie (78724) writes | more than 9 years ago

Programming 3

Dan Sandler has an article from a few days ago about RSS throttling, where he discusses the solution of having the server keep track of which clients have hit RSS feeds recently, so it knows when a client crosses the line and needs to be banned.Dan Sandler has an article from a few days ago about RSS throttling, where he discusses the solution of having the server keep track of which clients have hit RSS feeds recently, so it knows when a client crosses the line and needs to be banned.

This is exactly what we do on Slashdot, of course. Every hit, whether to a dynamically-generated perl script page, or to a static .shtml or .rss page, triggers an Apache PerlCleanupHandler which inserts a row into our 'accesslog' table on our MySQL database.

(By putting it in the cleanup phase, we ensure it doesn't affect page delivery times at all; it just means a few more milliseconds that the httpd child is occupied instead of being available to deliver pages, but the only resource it's taking up is RAM.)

Dan writes:

I'm uncomfortable with this solution because it's hard to make it scale. First, you have to hit a database (of some kind) to cross-reference the client IP address with its last fetch time. Maybe that's not a big deal; after all, you're hitting the database to read your website data too. But then you have to write to the database in order to record the new fetch time (if the RSS feed has changed), and database writes are slow.

I'll grant that our accesslog traffic is pretty I/O intensive. But if you were only talking about logging RSS hits and nothing else, it'd be a piece of cake. The table just needs three columns (timestamp, IP address, numeric autoincrement primary key). You expire old entries by deleting off one end of the table while you insert into the other. That way inserts never block, even under MyISAM (though I'd recommend InnoDB).

You only need to keep about an hour of the table around anyway, so it's going to be really slow. How many RSS hits can you get in an hour? A hundred thousand? That's peanuts, especially since each row is fixed size. Crunch that IP address down to a 32-bit int before writing it and each row is 12 bytes, give or take. Throw in the indexes and the whole table is a few megabytes. Even a slow disk should be able to keep up -- but if you're concerned about performance, heck, throw it in RAM.

To catch bandwidth hogs, you create a secondary table that doesn't have so much churn. It has an extra column for the count of RSS hits, so if some miscreant nails your webserver 1,000 times in a minute, the secondary table only gets 1 row. You periodically (every minute or two) check the max id on that table, then

INSERT INTO secondary_table SELECT ip, MAX(ts), COUNT(*) FROM table WHERE id BETWEEN last_checked+1 AND current_max GROUP BY ip

By limiting the id to a range, again, there is no blocking issue with the ongoing inserts. After doing that, you trim off rows from secondary_table older than an exact time amount, and then you're ready to do the only query that even approaches being expensive:

SELECT ip, SUM(hitcount) AS s FROM secondary_table HAVING s > your_limit GROUP BY ip

and you have your list of IP addresses that have exceeded your limit.

What we do is use that data to update a table that keeps track of IP addresses that need to be banned from RSS, and have a PerlAccessHandler function that checks a (heavily cached) copy of that table to see whether the incoming IP gets to proceed to the response phase or not.

Slashdot's resource requirements are actually a lot higher than this, since we log every hit instead of just RSS, we log the query string, user-agent, and so on -- and also because we've voluntarily taken on the privacy burden of MD5'ing incoming IP addresses so we don't know where users are coming from. That makes our IP address field 28 bytes longer than it has to be. But even so, we don't have performance issues. Slashdot's secondary table processing takes about 10-15 seconds every 2 minutes.

As for Dan's concern about IP addresses hidden behind address translation -- yep, that's a concern. (We don't bother checking user-agent because idiots writing RSS-bombing scripts would just spam us with random agents.) The good news is that you can set your limits pretty high and still function, since a large chunk of your incoming bandwidth is that top fraction of a percent of hits that are poorly-written scripts. Even a large number of RSS feeds behind a proxy shouldn't be that magnitude of traffic. We do get reader complaints, though, and for a sample of them, anyone thinking about doing this might want to read this thread first.

cancel ×

3 comments

trackbacks and pings (1)

tf23 (27474) | more than 9 years ago | (#11096769)

I would imagine the same scenario (or, minimally, similar) could be used with managing trackbacks and pings then.

It's funny how someone posts an article about "how such and such should be done" and slashcode's already been dealing with it for years.

Throttling RSS (1)

eggboard (315140) | more than 9 years ago | (#11098736)

I've been concerned that you'd wind up losing large numbers of aggregators behind big proxies or anonymizers. If you have 2,000 AOL users running the same aggregator software that tries to check on the top of the hour for feeds -- only the first 50 or 100 get the update? That doesn't seem to scale for RSS the way it would for Slashdot's throttling mechanism.

Re:Throttling RSS (1)

jamie (78724) | more than 9 years ago | (#11100454)

It's pretty easy to check that on our site; a pretty much fixed percentage of our users create accounts, log in, post comments, and generally contribute to the site. We log that grouped by IP as well, so when we see an IP whose RSS is blocked and which has activity from n logged-in users, we can estimate that there are k*n actual users behind it and it's probably a proxy. We manually look at those IPs, and allow the proxies a lot more RSS hits.

If your site doesn't support logged-in user participation it's harder to distinguish 2,000 users behind a proxy from a DoS, of course.

Oh, and yes, a ton of dumbass software does update exactly at the top of the hour. Typically we see 500 extra RSS hits within about 2 seconds. Stupid synchronized clocks. We get spikes at 10 minute intervals too.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...