On Tuesday, something terrible happened. The effects rippled through the world. And Slashdot was hit with more traffic than ever before as people grabbed at any open line of communication. When many news sites collapsed under the load, we managed to keep stumbling along. Countless people have asked me questions about how Slashdot handled the gigantic load spike. I'm going to try to answer a few of these questions now. Keep reading if you're interested.
I woke up and it seemed like a normal day. Around 8:30 I got to the office and made a pot of coffee. I hopped on IRC, started rummaging through the submissions bin, and of course, began reading my mail. Within minutes someone told me on IRC what had happened just moments after the impact of the first plane. Just a minute or 2 later, submissions started streaming into the bin. And at 9:12 a.m. Eastern Time, I made the decision to cancel Slashdot's normal daily coverage of "News for Nerds, Stuff that Matters," and instead focus on something more important then anything we had ever covered.
I couldn't get to CNN, and MSBNC loaded only enough to show me my first picture of the tragedy. I posted whatever facts we had: these were coming from random links over the net, and from Howard Stern who syndicates live from NY, even to my town. Over the next hour I updated the story as events happened. I updated when the towers collapsed. And the number of comments exploded as readers expressed their outrage, sadness, and confusion following the tragedy.
Not surprisingly, the load on Slashdot began to swell dramatically. Normally at 9:30 a.m., Slashdot is serving 18-20 pages a second. By 10 we were up to 30 and spiking to 40. This is when we started having problems.
At this point Jamie and Pudge were online and we started trying to sort out what we could do. The database crashed and Jamie went into action bringing it back up. I called Krow: he's on Western time, but he knows the DB best, and I had to wake him up. But worst of all, I had to tell him what had happened in New York. It was one of the strangest things I've ever done: it still hadn't settled in. I had seen a few grainy photos but I don't have a TV in my office and hadn't yet seen any of the footage. After I hung up the phone I almost broke down. It was the first time, but not the last.
The DB problem was a known bug and the decision was made to switch to the backup box. This machine was a replicated mirror of Slashdot, but running a newer version of MySQL. We hadn't switched the live box simply because it meant taking the site down for a few minutes. Well we were down anyway, and the box was a complete replica of the live DB, so we quickly moved.
At this point the DB stopped being a bottleneck, and we started to notice new rate limits on the performance of the 6 web servers themselves. Recently we fixed a glitch with Apache::SizeLimit: Functionally, it kills httpd processes that use more then a certain amount of memory, but the size limit was to low and processes were dying after serving just a few requests. This was complicated by the fact that the first story quickly swelled to more than a thousand comments ... we've tuned our caching to Slashdot's normal traffic: 5000-6000 comments a day, with stories having 200-500 comments. And this was definitely not the normal story. Our cache simply wasn't ready to handle this.
Our httpd processes cache a lot of data: this reduces hits to the database and just generally makes everything better. We turned down the number of httpd processes (From 60 on each machine, to 40) and increased the RAM that each process could use up (From 30 to 40 and later 45 megs) We also turned off reverse hostname lookups which we use for geotargetting ads: The time required to do the rdns is fine under normal load, but under huge loads we need that extra second to keep up with the primary job: spitting out pages as fast as possible.
This was around noon or so. I was keeping a close eye on the DB and we noticed a few queries that were taking a little too long. Jamie went in and switched our search from our own internal search, to hitting Google: Search is a somewhat expensive call on our end right now, and this was necessary just to make sure that we could keep up. We were serving 40-50 pages/second ... twice our usual peak loads of around "Just" 25 pages a second. I drove the 10 minutes to get home so I could watch CNN and keep up better with what was happening.
We trimmed a few minor functions out temporarily just to reduce the number of updates going to frequently read tables. But it was just not enough: The database was now beginning to be overworked and page views were slowing down. The homepage was full of discussions that were 3-4x the average size. The solution was to drop a few boxes from generating dynamic pages to serving static ones.
Let me explain: most people (around 60-70%) view the same content. They read the homepage and the 15 or so stories on the homepage. And they never mess with thresholds and filters and logins. In fact, when we have technical problems, we serve static pages. They don't require any database load, and the apache processes use very little memory. So for the next few hours, we ran with 4 of our boxes serving dynamic pages, and 2 serving static. This meant that 60-70% of people would never notice, and the others would only be affected when they tried to save something ... and then they would only notice if they hit a static box, which would happen only one in 3 times. It's not the ideal solution, but at this point we were serving 60-70 pages a second: 3x our usual traffic, and twice what we designed the system for. We got a lot of good data and found a lot of bottlenecks, so next time something that causes our traffic to triple, we'll be much more prepared.
At the end of the day we had served nearly 3 million pages -- almost twice our previous record of 1.6M, and far more then our daily average of 1.4M. During the peak hours, average page serving time slowed by just 2 seconds per page ... and over 8000 comments were posted in about 12 hours, and 15,000 in 48 hours.
On Wed. we started to put additional web servers into the pool, but that ended up not being necessary. We stayed dynamic and had no real problems on all 6 boxes all day. We peaked at around 35-40 pages/second. We served about 2 million pages. Thursday traffic loads were high, but relatively normal.
Summary So here is what we learned from the experience.
- We have great readers. I had only one single flame emailed to me in 24 hours, and countless notes of thanks and appreciation. We were all frazzled over here and your words of encouragement meant so much. You'll never know.
- Slashteam kicks butt. Jamie, Pudge, Krow, Yazz, Cliff, Michael, Jamie, Timothy, CowboyNeal, you guys all rocked. From collecting links to monitoring servers, to fixing bits of code in real time. It was good seeing the team function together so well ... I can't begin to describe the strangess of seeing 2 seperate discussions in our channel: one about keeping servers working, and another about bombs, terrorists, and war. But through it all these guys each did their part.
- Slash is getting really excellent. With tweaks that we learned from this, I think that our setup will soon be able to handle a quarter million pages an hour. In other words, it should handle 3x Slashdot's usual load, without any additional hardware. And with a more monstrous database, who knows how far it could scale.
- Watch out for Apache::SizeLimit if you are doing Caching.
- Writing and reading to the same innodb MySQL tables can be done since it does row-level locking. But as load increases, it can start being less then desirable.
- A layer of proxy is desirable so we could send static requests to a box tuned for static pages. For a long time now we've known that this was important, but its a tricky task. But it is super necessary for us to increase the size of caches in order to ease DB load and speed up page generation time ... but along with that we need to make sure that pages that don't use those caches don't hog precious apache forks that have them. Currently only images are served seperately, but anonymous homepages, xml, rdf, and many other pages could easily be handled by a stripped down process.
What happened on Tuesday was a terrible tragedy. I'm not a very emotional person but I still keep getting choked up when I see some new heart breaking photo, or a new camera angle, learn some new bit of heart breaking information, or read about something wonderful that somebody has done. This whole thing has shook me like nothing I can remember. But I'm proud of everyone involved with Slashdot for working together to keep a line of communication open for a lot of people during a crisis. I'm not kidding myself by thinking that what we did is as important as participating in the rescue effort, but I think our contribution was still important. And thanks to the countless readers who have written me over the last few days to thank us for providing them with what, for many, was their only source of news during this whole thing. And thanks to the whole team who made it happen. I'm proud of all of you.