Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Behind the Scenes At Sony's NOC

kdawson posted more than 6 years ago | from the 140-servers-each dept.

Role Playing (Games) 49

VonGuard writes "Earlier this year, I spoke to Mark Rizzo, the man who manages the people who run Sony's online game servers. Rizzo learned the ropes of MMO hosting back on Ultima Online, and we chatted about where the tough problems were then versus now. Rizzo compares the operation to a 24/7 scientific simulation, albeit with some sassier and more involved end-users. His favorite innovation since those early days? Rapidly provisioning and deploying Linux installations tailor-made to their purposes. Here's my article on Rizzo and his band of 50-some-odd sysadmin-cum-dungeon-masters, written for the new newspaper The Systems Management News."

cancel ×

49 comments

Sorry! There are no comments related to the filter you selected.

testing testing... 1 .... 2.... 3.. is this .. (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#23635389)

thing on???

Tag? (-1, Troll)

Anonymous Coward | more than 6 years ago | (#23635393)

Tagged this cummaster.

Summary: We have scripts (5, Funny)

BadAnalogyGuy (945258) | more than 6 years ago | (#23635417)

So to sum up, they have lots of programs that are constantly watched by scripts. They get to heave server machines around to expand certain areas and replace old servers. Their lives are mostly taken up with making sure that the backups are properly done on time each day and that no one accidentally steps on the power cord.

Fascinating!

Re:Summary: We have scripts (2, Funny)

thermian (1267986) | more than 6 years ago | (#23635463)

So he worked on Ultima Online eh? I guess one of those scripts will have the line "if (lord_british) keepalive;" then.

Re:Summary: We have scripts (2, Funny)

LandDolphin (1202876) | more than 6 years ago | (#23638793)

Only after August 8, 1997

And Remedy :P (5, Interesting)

Moraelin (679338) | more than 6 years ago | (#23635595)

No shit. And they also use Remedy. (Same as half the companies out there.)

That said, if they claim to be also architects, IMHO they do a poor job too.

E.g., at one point, after much lurking, and after I already had a big list of veteran awards in SWG, I want to post a suggestion. I didn't have a forum handle yet (hadn't needed one before), it's ok, I'll just go to the account management and create one. Turns out I'm sandboxed in a newbie forum noone else needs for the next two weeks, 'cause apparently the forum can't read from the database whether I'm on a trial account or a regular one. But it can read whether I have an active account, or whether I just deactivated it. (Sony's games were in the habit of asking why you quit. Post-NGE SWG was the only one which basically told me "go away, we don't take input from people with inactive accounts" after I filled that form.) But it can't read whether I'm on a trial account or not.

Well, it sounds to me like those architects of the server room don't do a particularly great job, then. Whatever interface they use to that customer database (SOAP, XMLRPC, plain SQL, whatever) should be trivial to extend to fetch that one extra piece of information. If month after month noone can figure out how to do that, it doesn't come across as a particularly competent architecture.

That, or they have no qualms with lying to the customers.

Additionally, I kinda find this funny, and while pioneered by UO, it's become a typically _Sony_ excuse later: "While today most of the problems faced by Rizzo's team are technical or development related, back in the Ultima Online days, these were compounded by the unpredictable player base. In its day, no one had ever seen the psychological and sociological reactions of players in a massive online world before."

Erm, no. The vast majority of problems UO had, were already known (and some even solved) by MUDs before. There was no excuse to repeat the same mistakes verbatim, and try the same things which were known not to work.

E.g., player justice was known not to work, as there's nothing you can do to the disposable character of a griefer, that its owner would care about. Plus, mobilizing whole posses to hunt down a griefer is, basically, just feeding the troll: he got some attention out of tens of people. Tens or hundreds of MUDs have tried that before, as it was the holy grail of being able to run a MUD without the non-fun headache of policing it, and it just didn't work without being backed by a lot of admin support. UO's recipe was known to fail, every time.

What really happened with UO was Lord British having his head so far up his own arse, that he couldn't see there's a world outside. He didn't as much discover those issues, as thoroughly ignored everything that had been discovered by anyone else. And then repeated the same thing with Tabula Rasa.

And as for Sony, since a lot of people there seem very fond of the same excuse: you have even less right to use that excuse, guys. SWG was a _third_ generation MMO, EQ2 is even later. There wasn't really an excuse even for UO to ignore the lessons of MUDs before it. Ignoring a couple dozen MMOs before you, is even less excusable.

And finally: how were those social issues relevant, in any form or shape, for the IT guys running the servers? I mean, seriously, they were (A) poor game design issues, (B) created some work for the coders who had to keep implementing fixes (which created more problems and the need for the next fix), and (C) a neverending headache for the GM's who had to sort out the thousands of support requests resulting from that fuck-up. Daily. But for the guys monitoring the servers and doing backups? Exactly how does it affect them whether the MMO is a friendly place or a newbie-hostile gank-fest run amok?

Re:And Remedy :P (5, Insightful)

kjart (941720) | more than 6 years ago | (#23635683)

All the problems you are describing are engineering/development issues and don't have anything to do with operations. The architects would be for the infrastructure, deployment, monitoring, etc etc, not for the games themselves.

Re:And Remedy :P (-1, Troll)

Anonymous Coward | more than 6 years ago | (#23636269)

don't forget the cum-dungeon masters.

Re:And Remedy :P (1)

brkello (642429) | more than 6 years ago | (#23638773)

Well, hope the rant made you feel better. But really, the people who do the forum server administration are most likely not the same who are administering the world servers. Besides, it sounds more of a limitation that was there because of the programmers. In other words, you are yelling at the chef because the waiter dumped dinner on your lap.

Re:And Remedy :P (0)

Anonymous Coward | more than 6 years ago | (#23643405)

The World of Warcraft method of player forums is much better. Tie your forum login to the master game login server. That way when the game goes down no one can login to the forums to bitch.

Re:Summary: We have scripts (0)

Anonymous Coward | more than 6 years ago | (#23655101)

Boys, get over to Twitter!

Hmmm (0)

Anonymous Coward | more than 6 years ago | (#23635419)

Pity it took them 16 months to fix the Antonius Bayle server for EQ1 and the problems with instances still pop up occasionally.
Terrible management of servers that are supposed to support a paying playerbase.

Re:Hmmm (0)

Anonymous Coward | more than 6 years ago | (#23635935)

I'm just thinking: 'How many million a month for all their MMO assets, and they have a lousy 50 people adminning the whole place?' :)

so is this running on a beowolf cluster (-1)

Anonymous Coward | more than 6 years ago | (#23635421)

of digital calculator watches?

remember those in the 80's?

sysadmin-cum-dungeon-masters (5, Funny)

Anonymous Coward | more than 6 years ago | (#23635449)

sysadmin-cum-dungeon-masters


Anyone else have images of S&M runnin through their minds?

Re:sysadmin-cum-dungeon-masters (4, Funny)

somersault (912633) | more than 6 years ago | (#23635487)

Beowulf's been a naught boy.. PRINT IT!!!

kill (1000+('od -An -N2 -i /dev/random')%2001)

Oh, you like that don't you!? Want me to do it again? First, I'm going to show you what a real glob is..

Re:sysadmin-cum-dungeon-masters (2, Funny)

RuBLed (995686) | more than 6 years ago | (#23635571)

And I thought Ballmer Peak [xkcd.com] was bad enough...

So anyone working on Sony's MMO division is a cum-dungeon bitch.. err.. minion now?

Re:sysadmin-cum-dungeon-masters (1, Funny)

Anonymous Coward | more than 6 years ago | (#23636935)

sounds like a badly titled porno.

Arr Eye Zee Zee Oh (1, Funny)

Anonymous Coward | more than 6 years ago | (#23635553)

Rizzo. Oh hang on a minute, that's Frank.

Re:Arr Eye Zee Zee Oh (1, Funny)

Anonymous Coward | more than 6 years ago | (#23635803)

Yeah... open your fuckin' ears, jackass!

What change management? (4, Interesting)

Antique Geekmeister (740220) | more than 6 years ago | (#23635593)

I see the article does not mention what they use for change management. I'm curious what they use: I like Bugzilla myself for ticket tracking, and it's potentially useful for configuration management as well, but needs significant revision to provide that or source control integration.

Most change control systems make odd choices between a business model of selling proprietary clients, strange choices of backend databases, and a focus on managing sales contact information, hardware inventory, software updates, filling out lots of forms for tracking minutes used doing the work, etc., etc. The choices of the change control system affect the workflow quite a lot: so I'm quite curious what they use. Does anyone here on Slashdot know?

Re:What change management? (0)

Anonymous Coward | more than 6 years ago | (#23635771)

Article hints to BMC Remedy Change Management Application [bmc.com]

Re:What change management? (2, Informative)

Zen (8377) | more than 6 years ago | (#23636925)

Did we read the same article? They use BMC Remedy along with some homegrown stuff.

Re:What change management? (0)

Anonymous Coward | more than 6 years ago | (#23644393)

Remedy is an actually pretty powerful piece of software. It allows the developer to create the front end logic a lot like Visual Basic, but without any coding. You basically draw your interface. All you need to do is setup the database tables to match your interface. I wrote a help desk system in Remedy once for a company that had account with J&J and Merk, and both companies loved it.

The real power of the product is that you don't need a team of programmers, or even a programmer at all to make changes to the system, just someone how can add new screens, features, abilities, and tie that into a database. So if your business needs to frequently update its Change Management requirements, Remedy is a lot better then canned packages that require code changes.

Re:What change management? (0)

Anonymous Coward | more than 6 years ago | (#23643043)

completely home grown change management system.. pretty cool actually. lots of nobs.

Promotion Strategy. (4, Funny)

Angostura (703910) | more than 6 years ago | (#23635745)

If you spend an awful lot of time grinding in the NOC you eventually become a level-something 'Network Architect' with no direct reports but with the ability to tell everyone what to do.

Taking a few management tips from in-game, perhaps?

Would love to hear more from these teams (5, Interesting)

magamiako1 (1026318) | more than 6 years ago | (#23635757)

I think a lot of people underestimate the requirements of running 24/7 online game servers for persistent worlds. There are definitely some serious architectural hurdles to overcome that don't necessarily exist in other areas of IT. In fact, one could say it's like "regular" IT work but on steroids.

For one, the server hardware has to be pretty powerful. Because it's doing a lot of high demand database work, everything from the lower layers of the hard disks to the file system to the software itself has to be fast and reliable.

For two, there is an increased demand for data reliability. If you manage an e-mail server and for some reason a flaw in the e-mail server doesn't pass e-mail on properly, you may be able to fix it and tell users to simply resend whatever e-mail they were sending and that's that. If a flaw comes up in the online game world that requires users to possibly "redo" something they did in the game, you will immediately lose a vast majority of your playerbase as they will see the game as unreliable.

That said also, the servers are very high demand 24/7. Even when the maintenance times are scheduled outages, people still complain. Generally in a normal business IT scenario, you can reboot a few servers here or there and nobody will notice anything during off time. So you've got change control windows that can occur 2 hours before anyone else gets to work and have to use the system, and they won't care one way or the other as long as everything's fine when they get into the office.

The databases are vast, doing constant read/write operations. Again, constantly changing database as players move about the world and interact. Exchanging items, gold, leveling up, learning new abilities.

Clustering and load balancing become very real problems for game servers. This is extremely apparent when you look at Blizzard where they number over 200 seperate, completely independent realms worldwide.

We won't even get into issues where the game world can't be dynamic and involving due to the technical limitations that we have, resulting in very limited forms of gameplay.

And again, you cannot forget the customer base. You know, if Joe cannot access e-mail for an hour because something is up with his e-mail account on the server, in most situations that's perfectly fine, he has something else he can do and you won't necessarily lose money on productivity. If Joe cannot access his online gaming character, you have the potential to lose a sale and a customer.

Very high demand indeed.

Re:Would love to hear more from these teams (1)

Jellybob (597204) | more than 6 years ago | (#23635915)

Very high demand indeed.

Not really. I read the article (yeah... I must be new), and it looked like every the work done by every other NOC in the world.

Sure, there's a lot of servers to manage, but if you've got everything automated anyway, it doesn't really matter how many thousands of servers you have. If one goes done, reimage it, and get on with life. Maybe they have to go and change a blown hard disk now and again.

Re:Would love to hear more from these teams (1, Informative)

magamiako1 (1026318) | more than 6 years ago | (#23635939)

Jellybob:

As I stated, I implied that the environment is very high demand. It's not quite like every other datacenter environment in terms of a systems architecture point of view because of the nature of the data.

Just from a systems architecture point of view, and pardon if I'm not too well versed in database architecture as some others--but in the community version of MySQL there are multiples of ways to do backups and database tracking for recovery if needed. One of which is is how to track the database data in the event of a database failure. You can backup the SQL database files, but what if the data hadn't been paged out to disk yet? Stuck in cache somewhere that got erased when the machine powered off?

This sort of issue might not be too big of a deal if Joe Schmoe's forum user account needs to be restored from a backup, but it's a big problem when his game character loses a very important item that he just obtained or an achievement he received.

And while of course that might not be a "big problem" to you because it's a video game and "nobody should care that much", it's still a big data problem nonetheless that puts it on par with say, medical information on a patient that didn't get stored properly.

Re:Would love to hear more from these teams (3, Interesting)

Jellybob (597204) | more than 6 years ago | (#23636357)

You can backup the SQL database files, but what if the data hadn't been paged out to disk yet? Stuck in cache somewhere that got erased when the machine powered off?

You replay the binary logs of any transactions that were run since the last backup.

I'm not saying it's not a big problem because it's a game - I play lots myself, and understand the frustration when things break. I'm saying it's not a big problem because whether your tracking forum posts, medical records, or game players, when it gets to the database and hardware level, it's all the same thing.

These are solved problems. The headline may as well be "sysadmins adminster systems for Sony". The only reason this is getting any coverage is because they mentioned MMOs at some point.

Re:Would love to hear more from these teams (2, Insightful)

sticks_us (150624) | more than 6 years ago | (#23636027)

As you say, there's no doubt these people are doing impressive things, but to anyone with experience dealing with e-commerce solutions (read: involving people's money), all of these measures will probably seem familiar.

The problems mentioned above about transactional integrity, backup/restore, availability, clustering, "five nines" uptime have all been largely addressed at places like Amazon, Bank of America, and so on.

Re:Would love to hear more from these teams (1)

magamiako1 (1026318) | more than 6 years ago | (#23636047)

True. As I said in my post, gaming ranks up there as one of the most high demand environments. An environment that a large amount of users don't necessarily ever deal with because they don't deal with data on that scale of availability.

It was just a post to put it into perspective for perhaps some readers who take video game server environments for granted because it's a video game :)

Re:Would love to hear more from these teams (1)

magamiako1 (1026318) | more than 6 years ago | (#23636075)

Even so, I would like to add that users in those environments don't have to interact with each other. If your processing load (say, server-side scripts and workloads) gets too high for a particular individual box, you can load balance with another machine and generally be okay with that.

You can't simply add "another box" to the game environment for a single instance of the game server since you run into issues where users interact with each other and movement data is processed and sent between server/client. You would need software that gracefully handles transmitting this data over a high bandwidth link to another server.

It's not something I've personally seen, but again, this is why I noted it would be great to really get into the nitty gritty to see how these environments actually pull off some of the things they do and where they're headed.

Re:Would love to hear more from these teams (1)

Lodragandraoidh (639696) | more than 6 years ago | (#23636989)

..."five nines" uptime ...
Snake Oil I say! In every implementation I've had the pleasure to be a part of, when the subject of '5-9s uptime' comes up it is quickly shown that ensuring such an outrageous performance level makes the cost of the service exceed the revenue generated.

'Five-Nines' is 99.999% uptime - which equates to approximately 5.25 minutes of down time per year - or ~ 6 seconds per week.

In my experience measurements are taken for the overall system -- so assuming you have a customer accessing system of 1000 or more servers, one machine failure of 6 minute duration just blew your numbers for the year. If you want redundancy on those 1000 servers, now you are talking about doubling the number of servers to 2000 - and similarly setting up any external devices/services that aren't customer facing per se (such as NAS, DBs etc). The costs can quickly become excessive. The larger your deployment, the more likely you are to blow your numbers

Re:Would love to hear more from these teams (4, Funny)

Minwee (522556) | more than 6 years ago | (#23637459)

The secret to achieving five nines uptime is not to improve the reliability of the systems, but instead to be very careful about how you define "uptime".

"Hey, about those two hours of downtime last night..."

"There wasn't any downtime."

"No, really, the phones were lit up with people complaining that the applications weren't answering properly..."

"So the applications were answering queries? Then they were up. It's not downtime."

"But they were answering queries with error messages."

"Then that's an application problem. The system was still up."

"But the error messages said 'No response from database'. The database servers were down."

"No they weren't. They were still running. They still had power. The servers were up. It's not as if they fell down out of the racks. You can't call it downtime just because a few programs aren't behaving exactly the way you want."

"So about this SLA..."

"Five nines, baby. We've still got five nines."

Re:Would love to hear more from these teams (3, Insightful)

Capitalist Piggy (1298699) | more than 6 years ago | (#23635927)

You make this whole thing sound like it's a 99.999% uptime venture, when I've seen EQ, WoW and Planetside servers go down for days at a time.

Having spent much of my grown life as a NOC monkey, I can assure you heads would roll at the ISPs I've worked at if we had nearly the number and lengths of outages experienced in the gaming world.

I don't see how this is more "involved" as far as the end user is concerned. What's going to happen on an MMORPG? People will post in forums and not ever see a response. That's not involvement. Involvement is when you've got three call centers with a two hour hold time, the random crazy person finding your NOC number, and directors having emergency meetings over even minor outages because these particular millions of customers have stocks to purchase, games to play, and email to check and they have a nice 1-800 number to dial instead of hitting a forum that's likely going to be down if your game servers are having trouble.

I think Sony is just doing a little self-appreciation in the article, as I don't really expect anyone at any company to say the guy monitoring the network at night is playing Q3 on his workstation or about the guy who shows up on meth sometimes.

Re:Would love to hear more from these teams (2, Insightful)

justthinkit (954982) | more than 6 years ago | (#23638269)

Having spent much of my grown life as a NOC monkey, I can assure you heads would roll at the ISPs I've worked at if we had nearly the number and lengths of outages experienced in the gaming world.

And the obvious difference is that with an ISP you don't have dozens or hundreds of people trying new ways to game the system. With fail over, live backup servers and cron jobs aplenty, you just swap out/swap in and you are good to go. With MMORPGs, someone hacks the system and you have to shut it down deliberately, pour yourself a double-shot and let out a loud WTF. Then study the hack, if you can, then engineer a work-around, then test it, then deploy it. Then bring the system back up. Yeah, these are very comparable systems alright.

Another difference is when a new MMORPG becomes popular and you go from a hundred test users to a thousand gamers to 100,000 to 1,000,000 in about a week. Gee, our roomful of servers has to become a building full. Should take just 0.001% of a year to do that. Capacity is a chicken and egg situation -- you aren't going to buy a thousand servers before you have even launched a game so you are forced to play catch up.

Re:Would love to hear more from these teams (1)

Capitalist Piggy (1298699) | more than 6 years ago | (#23640583)

And the obvious difference is that with an ISP you don't have dozens or hundreds of people trying new ways to game the system. With fail over, live backup servers and cron jobs aplenty, you just swap out/swap in and you are good to go. With MMORPGs, someone hacks the system and you have to shut it down deliberately, pour yourself a double-shot and let out a loud WTF. Then study the hack, if you can, then engineer a work-around, then test it, then deploy it. Then bring the system back up. Yeah, these are very comparable systems alright.
You apparently don't know what you are talking about when it comes to an ISP having totally redundant hardware that just requires a flip of the switch when something happens. ISPs are rather big, easy targets for someone with some skills. This is primarily due to them being just like any other organization with a large network. Upper level management decides they want more openness to stream-line things because one director fell for an engineer's gripe about how they should have access to port whatever from anywhere on the Internet, resulting in a VP thinking it's a good idea while all of Operations and Information Security are advising against it, looking "difficult" and start being shit upon when more requests come down the pipe. The next thing you know, you've not slept in four days because the entire enterprise has been "pwned" and you've got thousands of machines to audit. All while a CNN van and local news truck are sitting in the parking lot, hassling employees as they leave work.

In other instances, many machines that are "critical" in at least two major ISPs I've worked for are not redundant. Again, if a Master Nerd was in charge of the company, this wouldn't be an issue, but that's not been the case since probably 1996 at any big provider. There's too many layers of people in charge of the money that don't understand what they are dealing with, and due to internal politics, will base their decisions on who had the more eloquent argument.

Regardless, the description of job you are describing is a different type of engineering than the Network Operations Center. The issue you cite is more of a SysAdmin or Developer's responsibility.

Even if my view is skewed, I don't think it's nearly the priority to get machines back online as it is with the actual pipe connecting you to the game. Like I said, you don't have thousands of customers actually calling and demanding a month's free service because something failed for a couple of hours. It's just the way MMO's go, almost an expectation that the crack servers be down, lagged out, or having some random issue 10% of the time.

Re:Would love to hear more from these teams (1)

Gutter Dogg (640493) | more than 6 years ago | (#23638443)

What I would really love to hear about is the NOCs behind some of these big financial exchanges. Companies like the CME Group, (earns on average $434k per business hour, allegedly powered by a Linux cluster), NYSE Arca, or ICE, would no doubt have much robust infrastructure than a gaming operation.

cum-dungeon-masters TAG! (0, Funny)

Anonymous Coward | more than 6 years ago | (#23635773)

Pretty pretty please tag 'cum-dungeon-masters' just for this!

Sony Malware Central? (0, Troll)

feepcreature (623518) | more than 6 years ago | (#23635847)

So is this where the Sony BMG rootkit and auto-updating malware was supposed to be controlled from?

Re:Sony Malware Central? (0)

Anonymous Coward | more than 6 years ago | (#23638845)

I see what you did there. You attempted to make a joke....

Funny mis-read ! (1, Funny)

Anonymous Coward | more than 6 years ago | (#23636177)

I mis-read part of the last sentence in the summary as "50-some-odd sadism-cum-dungeon-masters" which, oddly enough, makes some sense.

Yeah, I have a scary mind. Boo !

Linux? (1, Informative)

wilder_card (774631) | more than 6 years ago | (#23636599)

So Linux is good enough for the servers, but no one can be bothered to make the client compatible with Linux. Wouldn't be hard thanks to Wine, but nope.

Re:Linux? (3, Insightful)

magamiako1 (1026318) | more than 6 years ago | (#23636951)

The client and what the server does and has to do are entirely separate things and pretty much have no relation with regards to each other in any way except that they communicate data back and forth for one or the other to process.

...sigh... (0, Offtopic)

ravrazor (69324) | more than 6 years ago | (#23638827)

For a board that ostensibly cares about eliminating stupid comments and trolling, why would an editor ever post a story summary that includes a reference to "50-some-odd sysadmin-cum-dungeon-masters"? Not only is the whole sentence awkward, but it is going to/has produced tons of asinine comments, and use up tons of mod points getting modded up by idiots who can't imagine anything more hilarious than references to semen, and then getting modded down by less immature, less socially retarded people reading who think those comments are just stupid.
Slashdot editors should really try getting a realistic perspective on who reads (or maybe just posts to) these discussions.

Shame they... (0)

Anonymous Coward | more than 6 years ago | (#23640897)

..kill their game servers off after only a few years!
MONSTER HUNTER

Let me tell you a story about UO (2)

genner (694963) | more than 6 years ago | (#23642767)

Blah blah blah bla blah and then the server crashed. All my stories end the same way.

Cum Dungeons? (0)

Anonymous Coward | more than 6 years ago | (#23645219)

Cum Dungeons?

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>