Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Supercomputers' Growing Resilience Problems

samzenpus posted about 2 years ago | from the a-thousand-potential-cuts dept.

Supercomputing 112

angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."

Sorry! There are no comments related to the filter you selected.

Hardly A New Problem (5, Informative)

MightyMartian (840721) | about 2 years ago | (#42061787)

Strikes me as a return to the olden days of vacuum tubes and early transistor computers, where component failure was frequent and brought everything to halt while the bad component was hunted down.

In the long run if you're running tens of thousands of nodes, then you need to be able to work around failures.

Re:Hardly A New Problem (1)

Anonymous Coward | about 2 years ago | (#42061891)

Furthermore, there's also the race for higher density, if you put 48 cores on a single box or a single chip, expect that one core failing will cause issues that need to be addressed from somewhere else, and that replacing the "system" will probably require 47 good cores to be quickly replaced.

As always, if you condense everything, you can expect that one piece in the whole system that breaks, causes failures in the rest of the system.

Re:Hardly A New Problem (1)

Tontoman (737489) | about 2 years ago | (#42062151)

Here is an opportunity to leverage the power of the HPC to detect and correct and workaround any integrity issues in the system. Not a new problem, just the opportunity for productive application of research.

Re:Hardly A New Problem (2)

CastrTroy (595695) | about 2 years ago | (#42062261)

I was thinking that if you had 100,000 nodes, that a certain percentage of them could be dedicated for fail over when one of the nodes goes down. The data, would exist in at least 2 nodes, kid of like RAID but for entire computers. If one node goes down, another node can take it's place. The cluster would only have to pause a short time while the new node got the proper contents into memory along with the instructions it needed to run. You'd need to do some kind of coding so a server could pick up as close as possible to where the other one died, but at least it wouldn't require an actual human to walk in and replace a part for the whole cluster to continue.

Re:Hardly A New Problem (1)

guruevi (827432) | about 2 years ago | (#42062377)

Problem had already been solved by a number of HPC schedulers.

Whoever this person is doesn't know shot about running a cluster but the problem has long been solved somewhere in the mid-80s

Re:Hardly A New Problem...and thus has been fixed (3, Insightful)

poetmatt (793785) | about 2 years ago | (#42062245)

The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down. That's literally the purpose of basic cluster technology from probably 10 years ago.

How do they act like this is a new, or magic issue? It doesn't exist if HPC people know what they're doing. Hell, usually they keep a known quantity of extra hardware out of use so that they can switch something on if things fail as necessary.

Re:Hardly A New Problem...and thus has been fixed (2)

cruff (171569) | about 2 years ago | (#42062983)

The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down.

You might think so, but I've seen a configuration with an interconnect fabric that was extremely sensitive to the fallback of individual links to the next lower link speed cause all sorts of havoc cluster wide.

Re:Hardly A New Problem...and thus has been fixed (1)

EETech1 (1179269) | about 2 years ago | (#42066613)

Is this something that can be partially avoided by using Itanium processors instead of X86? Or has all of the reliability stuff been included in the recent Xenon chips?

I understand that there's little the processor can do if the mobo dies (unless it notices an increase in failed computations right before maybe it could send a signal) but is there any advantage to the type of processor used in some cases?

Just curious...


Re:Hardly A New Problem...and thus has been fixed (1)

cruff (171569) | about 2 years ago | (#42067129)

Is this something that can be partially avoided by using Itanium processors instead of X86? Or has all of the reliability stuff been included in the recent Xenon chips?

No, not related to the processor, just a specific inter-node interconnect fabric implementation. The moral of the story is: just be aware that buying stuff that has just become available and trying to deploy it at scales well beyond what others (including the vendor) have done before leads to an effectively experimental configuration, and is not one that you should expect to behave like a production environment should for a period of time until the kinks are worked out. Of course, this plays hell with the overly optimistic schedules management thought should be achievable.

Re:Hardly A New Problem...and thus has been fixed (3, Informative)

markhahn (122033) | about 2 years ago | (#42063371)

"hegemonous", wow.

I think you're confusing high-availability clustering with high-performance clustering. in HPC, there are some efforts at making single jobs fault-tolerant, but it's definitely not widespread. checkpointing is the standard, and it works reasonably, though is an IO-intensive way to mitigate failure.

Re:Hardly A New Problem...and thus has been fixed (1)

gentryx (759438) | about 2 years ago | (#42064859)

Checkpoint/restart is actually a rather poor workaround as the ration of IO bandwidth to compute performance is shrinking with every new generation of supercomputers. Soon enough we'll spend more time writing checkpoints than doing actual computations.

I personally believe that we'll see some sort of redundancy on the node level in the mid-term future (i.e. the road to exascale), which will sadly require source code-level adaptation.

Re:Hardly A New Problem...and thus has been fixed (1)

Jah-Wren Ryel (80510) | about 2 years ago | (#42063373)

The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability.

Yeah, no surprise there. Historically, kings have never cared about what happens to the peons.

Re:Hardly A New Problem...and thus has been fixed (1)

sjames (1099) | about 2 years ago | (#42064151)

I don't think you understand the nature of the problem.

The more nodes N you use for a computation and the longer the computation runs, the greater the odds that a node will fail before the computation can complete. Small programs ignore the problem and usually work just fine. Should a node crash, the job is re-run from scratch.

More sophisticated programs write out a checkpoint periodically so that if (when) a node fails, it is replaced and the job restarts from the checkpoint. However, that is not without cost. Computation isn't happening when state is being written out. However, so far, it's been manageable.

The issue in TFA is that this will not scale. As you add nodes, the MTBF gets much smaller (and so checkpointing has to become more frequent). Eventually you would reach a point where adding nodes actually slows the computation down because of the necessary overhead of checkpointing. Add enough nodes and eventually it does nothing but checkpoint over and over. That's a classic scalability problem and it has no obvious solutions that don't involve a magic wand.

easy solution (2)

cheekyboy (598084) | about 2 years ago | (#42065251)

Just make checkpointing cost zero time. How, have each node, really be a dual node ( like most things in nature are ). So one half is always computing, and the other half is check pointing. Just like video games use of double buffering to do smooth fps. If checkpointing uses less cpu than comps, then swap the cores functionality every N runs to give each core a temperature rest, to cool down, to increase life.

Sure each node is bigger, but it could perhaps scale better, the overhead curve should be way better, and flat.

Hire some big phds to design mitigation mechanisms to increase MTBF by better environmental management, I dunno, give nodes a rest for 30 mins every 5 days. I dunno, more research needs to be done, and that takes tonne of time, (unless you can simulate it on a super computer) doh.

Just a thought.

Re:easy solution (1)

sjames (1099) | about 2 years ago | (#42067101)

That doesn't actually make checkpointing take zero time, it allocates it 50% of the time. meanwhile, in that scheme the cluster must be small enough that most of the time, there isn't a failure in a checkpoint interval.

Re:Hardly A New Problem...and thus has been fixed (1)

marcosdumay (620877) | about 2 years ago | (#42065625)

Well, it seems obvious that you need to distribute your checkpoints. Instead of saving the global state, you must save smaller subsets of it.

Now, writting a program that does that is no easy task. That's why it is a problem. But don't think it can't be solved, because it can.

Re:Hardly A New Problem...and thus has been fixed (1)

sjames (1099) | about 2 years ago | (#42067007)

I am already assuming that each node saves it's own state to a neighboring node. That speeds up the checkpoints and improves the scalability but doesn't fundamentally address the problem.

Re:Hardly A New Problem...and thus has been fixed (1)

marcosdumay (620877) | about 2 years ago | (#42067287)

That does address the problem. Once you each node saves its state to enough other nodes, you don't need to make a checkpoint anymore.

But then, you must write code that can work with just partial checkpoints, what is quite hard.

Re:Hardly A New Problem...and thus has been fixed (1)

sjames (1099) | about 2 years ago | (#42067773)

Saving state IS checkpointing, you know. And you cannot continue computation with missing state other than by re-computing it (which would be considered part of loading the checkpoint). The checkpoint images must also be coherent. Otherwise, you have lost state.


In the end, it just means that the maximum useful size of a cluster is limited by the reliability of the nodes.

Ultimately the problem may be sidestepped by going to non-stop style nodes (expensive as hell) but then your scalability is limited by how fast you can swap out the broken parts (alternatively, your run-time is limited by how many hot spares you can stuff in at the beginning). Now, the question comes down to how fast is the non-stop hardware (there has always been a speed penalty for that approach).

Re:Hardly A New Problem...and thus has been fixed (1)

marcosdumay (620877) | about 2 years ago | (#42067997)

The checkpoint images must also be coherent. Otherwise, you have lost state.

That's the restriction that must be removed. If your program can work with non-coherent subsets of the checkpoint, you won't have this scalability problem. And it is something possible, for any computation.

Re:Hardly A New Problem...and thus has been fixed (1)

sjames (1099) | about 2 years ago | (#42068107)

I'll be needing to see proof of that one.

Re:Hardly A New Problem...and thus has been fixed (0)

Anonymous Coward | about 2 years ago | (#42064909)

With many simulations, say something like climate modelling or fluid dynamics, the output from one node (a particular region of space) is used as input to the adjacent neighbors. So if that node fails, it would propagate a delay. It seems that you would want some kid of "buddy" system where if one node were to keel over, the neighbors could take up the slack.

Re:Hardly A New Problem (1)

hairyfeet (841228) | about 2 years ago | (#42062711)

But unlike in those days, where you had to basically go down row after row hunting for a bad solder joint or dead tube, this problem seems to be...well frankly easily fixed with just a little thought. After all we already have diagnostic programs for all the basic components, RAM, CPU, HDD, etc so i don't see why those programs couldn't be cooked into a ROM and give us a simple pass/fail indicator light on the front of the node. if you wanted you could have a pass/fail light for each of the major components but frankly just a simple pass/fail for the node would be all that it would take, then you just yank and replace the node and worry about diagnosing what went wrong with the bad node later...or hell even never, just have the unit sent off and refurbed where they can test all the components and then sell off the refurbed unit or put it back in storage for when another node dies.

Frankly with the ability to make little embedded ARM chips that take less than a watt and can run simple programs off a ROM like a diagnostics I don't see why this would be a problem, it just takes a little thought in the design beforehand.

Re:Hardly A New Problem (2, Insightful)

Anonymous Coward | about 2 years ago | (#42064105)

I think you're missing the fact that when a node dies in the middle of a 1024-core job that has been running for 12h you normally lose all the MPI processes and, unless the job has been checkpointed, everything that has been computed so far.

It's not just about hunting and replacing the dead nodes, it's about the jobs' resilience to failure.

Re:Hardly A New Problem (1)

hairyfeet (841228) | about 2 years ago | (#42064863)

Well frankly you shouldn't be writing programs that need every damned core of a 1024 node PC and using an OS that can't shift the load on a fail, should you?

hell most of these units are running Linux and if it is one thing i give Linux props for its being able to be customized for big ass jobs like this. Have a 10%-20% leeway when it comes to nodes and have the OS simply move the load from node 146 to 187(B) for backup and call it a day. again while this isn't 2+2 frankly its not this insurmountable problem like they are making it out to be, this IS doable with a little thought beforehand.

Re:Hardly A New Problem (0)

Anonymous Coward | about 2 years ago | (#42064971)

Original AC here. You haven't done much supercomputing, have you. You seem to fail to grasp that there are dependencies between the MPI tasks running.

The job is a 1024-core job and it's not using all cores like you'd think -- rather, it is using a fraction of a, say, 50 Kcore machine that it just got allocated from the batch system. Even if there are nodes to spare from the backfill, there was data ("state") on the node that you've just lost due to, say, a mobo failure. You can't just "move the load", because you'd have to "undo" the entire job to the previous checkpoint. With big jobs it's not feasible to checkpoint, because there are hundreds of GB of address space to be saved to disk. Hence, your only option is to abort the job. Or wait for the new MPI standard which promises resilience.

Re:Hardly A New Problem (1)

cheekyboy (598084) | about 2 years ago | (#42065287)


if motherboard #325 dies and kills 8 nodes at once, so if that requires a 12 hour job to re-run why not just have always several dozen spare idle nodes waiting to do that 12hr job in 1hr to catch up.

Or are you saying out of 1 million nodes, 1 fails every 1/2 hour, so you need 24 spare to catchup, which is feasable.

Or are you saying failure rate is 1 every minute? which would require 720 spare nodes every hour , which is a lot of new racks being installed daily.

If it scales linearly, then its ok, and wont drag to a stand still, but if it doesnt, then it will eventually scale down until it stops falling and reaching equilibrium..... and if hardware improves in time, it will start scaling automatically due to less failures.

Re:Hardly A New Problem (1)

Anonymous Coward | about 2 years ago | (#42065509)

if motherboard #325 dies and kills 8 nodes at once, so if that requires a 12 hour job to re-run why not just have always several dozen spare idle nodes waiting to do that 12hr job in 1hr to catch up.

Let me rephrase, because I cannot seem to get my point across.

The typical distributed-memory machine has several tens of thousands cores, bundled on, say, 4-core CPUs, bundled on, say, 2-CPU mobos. Each of these PCs is a node. Let us assume your optimistic view that you have spare nodes available.

Now, I have a job that takes 12h to complete using 1024 cores, so 128 nodes. The job runs as 1024 processes, 8 per node, which communicate over the interconnect through a software layer known as MPI. Memory is local to each process and unless it's an embarassingly parallel problem, the processes need to communicate by sending messages. Say it's 10th hour of runtime when one of the nodes dies. You've just lost the state (RAM contents) of 8 processes out of 1024. No amount of spare nodes is going to bring it back. Sure, you can replace the node hardware by using one of the spare nodes, but not the state -- it's gone. It cannot be recreated from the state of the other nodes. It cannot be recreated by using 80 cores and re-doing "just this bit" 10 times faster, because the bits are interdependent -- you need to run the WHOLE thing again. There is no way I can catch up within 1 hour, well, not unless I have 10240 free cores lying around and a perfectly scaling problem and can re-run the WHOLE thing.

The usual mechanism for dealing with this is checkpointing -- a glorified way of dumping the state of all processes to disk. But with a 1024-core job this might mean dumping some 1 TB, perhaps 2 TB of data to disk every once in a while.

Re:Hardly A New Problem (1)

hairyfeet (841228) | about 2 years ago | (#42065593)

I think what the guy is trying to say is his programs are written VERY badly, and instead of saving state after each step of the calculations its an "all or nothing" kind of deal he has going where ANY failure at all throws his badly programming ass back to the start.

Which if that is the case i gotta say...he deserves to be sent back to the beginning every time any damned thing goes wrong, because he is a dumbass. if he can split the damned load to 1024 CPUs he can damned well save the fucking state because those same messages that HAVE TO BE SENT to keep the 1024 cores on the same page can ALSO be used to send state data.

so i think we can see the problem cheekyboy, he's doing it back asswards and wrong.

Re:Hardly A New Problem (1)

EETech1 (1179269) | about 2 years ago | (#42066895)

So you would dedicate say 1 or 2 CPUs per node or a node per rack for sniffing all of the intermediate data off of the local highest speed interconnect as it's sent between nodes, and sending it out to a fifo queue on the (or a separate) network to store the intermediate results in case of a failure?

If you had the last couple chunks of data that every node sent to every other node it would make restarting from a failed node much easier as you would just have to reload the data in, recompute it, and compare the previous outputs and continue on with the job on a different node.

It would take additional resources per rack for the spare nodes and data recording nodes, as well as possibly a second network and local storage in the rack to save the data, but at the exascale level, this might make sense.

If you had 8 compute CPUs per node, and 1 more of them that had 16X the RAM and was a storage / supervisory CPU for the whole node, and a whole node per rack doing the same thing would it be able to keep up and would it cover most failures?

Is this something that could be built right in to future interconnect chips so they can store the last couple of transactions and reroute and dump the data to a different node if something fails?

It might be expensive, but so is the power required to run these huge computers and everything else about them.


Re:Hardly A New Problem (1)

hairyfeet (841228) | about 2 years ago | (#42067983)

Exactly. I mean think about it, you ever PRICE a 1024 node rack? Even going AMD that is a hell of a chunk of change, and having to start a calculation aaaaalllll the way back at the beginning if a node fails as the AC was talking about is simply unacceptable. when you look at the power bill alone for a 12 hour calc on a 1024 node going full bore frankly the cost of throwing an 8-16 core per rack just to store current state is frankly cheap compared to the alternative.

And again when you look at the cost I'd say this is definitely doable, put a RAID 5 of enterprise class HDDs with a couple of 256Gb SSDs for a fast store, and make sure that node has plenty of RAM so you would have a nice fast cache with plenty of buffering...yeah I'd say its doable. you could probably cook up some specialized ARM chips right into the interconnect for future designs but even what we have now with fibre channel i would say it could be done, and again look at how much money you lose by having a 1024 node go full bore for 10 hours like AC said only to have to start over at hour 11 because a node failed.

Remember when you are talking cost on this scale you have to figure in cost of operation and having to double that cost because a node fails is simply unacceptable, not when you could slap a 16 core with 64Gb of RAM and some SSDs feeding into HDDs to have current state backed up. that way if a node fails just like RAID that data could be rebuilt on a fresh node and the calc could continue instead of starting over so...yeah I still say with some planning it should be doable.

Oh and Happy Turkey day to you and yours, I'm getting ready to go do the whole roast beast with the fam and in the spirit of the holiday my oldest is bringing a local gal who doesn't have any family, and I hope you and yours look around and make sure there isn't anybody sitting alone with nobody on the holiday. I brought the guest last year but this year the new owner of the apt building is having a HUGE blow out in the back of the building, with a bunch of smokers cooking turkey and ham, so nobody in this building is gonna be alone this Thanksgiving which i thought was DAMN nice and told him so. Happy Holidays!

Re:Hardly A New Problem (1)

loose electron (699583) | about 2 years ago | (#42063319)

Most MPP machines that I am familiar with have a system where the status and functionality of all nodes is checked as part of a supervisory routine and mapped out of the system. Bad Node? It goes on the list for somebody to go out and hot swap out the device. Processing load gets swapped to another machine.

Once the new device is in place that same routine brings that now functioning processor back into the system.

That sort of thing has existed for at least 10 years and probably longer.

A Cloudy Future (0)

Anonymous Coward | about 2 years ago | (#42061825)

I would think the same concurrency one must apply to process within such a system could be applied in today's IaaS infrastructures. Perhaps some problems don't scale to the cloud. But then, I would think many of these could be run on several super-ish computers with a mix of the techniques for cloud and "local" concurrency. Is this really a problem for anyone but the sales staff offering supercomputers to large institutions?

"they halt operations when they [fail]" (2)

Let's All Be Chinese (2654985) | about 2 years ago | (#42061865)

Ah-ha-ha! You wish failing kit would just up and die!

But really, you can't be sure of that. So things like ECC become minimum requirements, so you might at least have an inkling something isn't working quite right. Because otherwise, your calculations may be off and you won't know about it.

And yeah, propagated error can, and on this sort of scale eventually will, suddenly strike completely out of left field and hit like a tacnuke. At which point we'll all be wondering, what else went wrong and we didn't catch it in time?

Re:"they halt operations when they [fail]" (1)

Anonymous Coward | about 2 years ago | (#42062345)

global warming simulations come to mind.

Componentry? (2)

Beardydog (716221) | about 2 years ago | (#42061901)

Is an increasing amount of componentry the same thing as an increasing number of components?

Re:Componentry? (5, Funny)

Anonymous Coward | about 2 years ago | (#42061959)

Componentry is embiggened, leading to less cromulent supercomputers?

Re:Componentry? (1)

rubycodez (864176) | about 2 years ago | (#42066639)

only true for "noble componentry"

Re:Componentry? (2)

blade8086 (183911) | about 2 years ago | (#42062881)

an increasing amount of componentry has increased componentry with respect to an increasing number of components.

In other words,

componentry > components
increasing amount > increasing number.


componenty + increasing amount >> components + increasing number
componentry + increasing number =? increasing amount + components

unfortunately, to precisely determine the complexity of the componentry,
more components (or an increasing number of componentry) with resepct to the original summary, are required.

Re:Componentry? (1)

Inda (580031) | about 2 years ago | (#42064763)

Dictionary definition: A bullshit term to feed to a client in order to stall them and placate their demands for results.

Other dictionaries may show different results.

Re:Componentry? (1)

Impy the Impiuos Imp (442658) | about 2 years ago | (#42066209)

Haw haw haw!

Back in the vacuum tube days, engineers wrote estimates on the maximium size of computers given the half-life of a vacuum tube. 50,000 tubes later, it's running for minutes only between breakdowns.

Re:Componentry? (0)

Anonymous Coward | about 2 years ago | (#42068557)

That reminds me of one of my pet peeves: functionalities.

I remember, I think it was towards the end of the 1980's, that suddenly my collegues started to talk about functionalities for things we used to call the functions of a system. This may be different in English, but in Dutch the word 'functionaliteit' didn't have a plural as far as I was aware. According to the leading Dutch dictionary it meant (and still means) conformance with the function (I hope that's a reasonable translation), and it mentions the synonym 'doelmatigheid', which means effectiveness. So what exactly are effectivenesses? I couldn't find an answer in regular dictionaries and IT dictionaries didn't mention the word at all at the time (this was before the Internet became ubiquitous). So I started asking my collegues what they meant when they talked about functionalities, and what distinguished those from functions. Not a single one was able to answer that question, at best they made a clumsy guess. But they happily kept using the fancy sounding word. I don't do it often anymore, but occasionally I still ask the question, and after more than 20 years I haven't got a single satisfying answer. I never use the word myself, no-one ever seems to have noticed that I don't, and I have no trouble expressing myself. There doesn't seem to be a reason for the word to exist.

Aftes I got connected to the Internet of course I looked for definitions there, I did it fairly recently, even. The definitions in Dutch are inconsistent, sources don't agree on the meaning. I found a definition in English somewhere that might make sense: functionalities are functions as the user sees them. Ah, but before we had functionalities we just called those 'user functions' if the context didn't already make clear what was meant. That's a term a user actually understands without needing an explanation, and if that is what it means we did things the wrong way around by creating jargon not for our own specialized meanings of a word but for the common meaning. And if so many people can't explain to me how functionalities are different from functions, how do they explain it to the users?

The correct question to ask probably isn't what componentry is but what componentries are.

Is this really a problem these days? (1)

Anonymous Coward | about 2 years ago | (#42061933)

Even my small business has component redundancy. It might slow things down, but surely not grind them to a halt. With all the billions spent and people WAY smarter than me working on those things, I really doubt its as bad as TFA makes it out to be.

Whilst obviously tripling cost... (1)

Anonymous Coward | about 2 years ago | (#42061955)

This just says to me that they need to buy three of every component and run a voting system and throw a fail if one is way off the mark.

Re:Whilst obviously tripling cost... (1)

dissy (172727) | about 2 years ago | (#42064365)

They do. The problem is that is a lot of waste, which does not scale well.

With 1000 nodes, triple redundancy means only ~333 nodes are producing results.
In a couple years, we will be up to 10000 nodes, meaning over 6000 nodes are not producing results.
In a few more years we will be up to 100,000 nodes, meaning 60,000 nodes are not producing results.

Those 60000 nodes are using a lot of resources (power, cooling, not to mention cost) and the issue is they need to develop and implement better methods to do this.

Re:Whilst obviously tripling cost... (1)

Shavano (2541114) | about 2 years ago | (#42066263)

67% overhead in a computing system is considered unacceptable in most applications.
How you respond to a failure is a big deal when you get to systems so large that they're statistically likely to have component failures frequently. It's often unacceptable to just throw out the result and start over. The malfunctioning system needs to be taken offline dynamically and the still-working systems have to compute around it without stopping the process. That's a tricky problem.

"and they halt operations when they do so" (5, Informative)

brandor (714744) | about 2 years ago | (#42061967)

This is only true in certain types of supercomputers. The only one we have that will do this is an SGI UV-1000. It surfaces groups of blades as a single OS image. If one goes down, the kernel doesn't like it.

The rest of our supercomputers are clusters and are built so that node deaths don't effect the cluster at large. Someone may need to resubmit a job, that's all. If they are competent, they won't even lose all their progress by using check-pointing.

Sensationalist titles are sensationalist I guess.

Re:"and they halt operations when they do so" (1)

Meeni (1815694) | about 2 years ago | (#42062025)

Checkpoints won't scale to future generations. But what is amusing is to see some random ph.D student being cited here instead of the people who actually came to that conclusion some time ago already :)

Re:"and they halt operations when they do so" (1)

fintler (140604) | about 2 years ago | (#42062125)

Checkpoints will probably stick around for quite some time, but the model will need to change. Rather than serializing everything all the way down to a parallel filesystem, the data could potentially be checkpointed to a burst buffer (assuming a per-node design) or a nearby node (experimental SCR design). Of course, it's correct that even this won't scale to larger systems.

I think we'll probably have problems with getting data out to the nodes of the cluster before we start running into problems with checkpointing. The typical NFS home directory isn't going to scale. We'll need to switch over to something like udsl projections or another IO forwarding layer in the near future.

Re:"and they halt operations when they do so" (1)

Anthony (4077) | about 2 years ago | (#42062453)

This is my experience as well. I have supported a shared memory system and now a distributed memory cluster, the resilience of the latter means job resubmission of jobs related to the failing node is the standard response. A failed blade (Altix brick in my case) meant the entire numalink connected system would go down. Component-level resilience and predictive diagnostics help. Job suspension and resumption and/or migration is also useful to work around predicted or degraded component failure.

Re:"and they halt operations when they do so" (2)

Tirian (166224) | about 2 years ago | (#42062479)

Many supercomputers that utilize specialized hardware just can't take component failure. For example, on a Cray XT5, if a single system interconnect link (SeaStar) goes dead the entire system will come to a screeching halt because with SeaStar all the interconnect routes are calculated at boot and can not update during operation. In any tightly coupled system these failures are a real challenge, not just because the entire system may crash, but if users submit jobs requesting 50,000 cores but only 49,900 cores are available.

Checkpoints are necessary, but in large-scale situations they are often difficult. You usually have a walltime allocation for your job and you certainly don't want to use 20% of it writing checkpoint files to Lustre (or whatever high-performance filesystem you are utilizing). Perhaps frequent checkpointing works on smaller systems/jobs, but for a capability job on a large system you are talking about a significant block of non-computational cycles being burned.

Re:"and they halt operations when they do so" (1)

postbigbang (761081) | about 2 years ago | (#42062959)

The problem is also with the classical Von Neumann model of the state machine. You can have many nodes that do work, then sync-up at different points as dependencies on a JCL- like program. When you have common nodes running at CPU clock, then the amount of buffer and cache that gets dirty is small, and the sync-time is the largest common denominator amongst the calculating nodes. When you bust that model by having an error in one or more nodes, then the sync can't happen until the last node is caught up. Other non-dependent functions can continue, but at some point, it grinds to a halt waiting either the initial node to complete, or for a replacement to complete, but completion nonetheless.

When you run asynchronous jobs, the cycle time becomes less dependent on node failure, but more dependent on a competent coder's error handler that knows how to react, and spawns whatever's necessary to bring about an appropriate reaction to a node failure whilst keeping the rest of the jobs humming along.

I think ZFS was a great start to having large data sets operated on by concurrent objective code sets, but tightly coupled systems are a recipe for disaster because of their state-machine fragility.

Re:"and they halt operations when they do so" (1)

Anonymous Coward | about 2 years ago | (#42063881)

This is one of the primary differences between XT (Seastar) and XE/XK (Gemini) systems. A Gemini based blade can go down and traffic will be routed around it. Blades can even be "hotswapped" and added back into the fabric as long as the replacement is identically configured.

Re:"and they halt operations when they do so" (2, Informative)

Anonymous Coward | about 2 years ago | (#42063613)

Pretty much all MPI-based codes are vulnerable to single node failure. Shouldn't be that way but it is. Checkpoint-restart doesn't work when the time to write out the state is greater than MTBF. The fear is that's the path we're on, and will reach that point within a few years.

Re:"and they halt operations when they do so" (0)

Anonymous Coward | about 2 years ago | (#42064361)

A fault tolerant single image computer should be able to withstand a loss of a compute node. UV-1000 is obviously not meant to be a fault tolerant computer based on the product description.

Re:"and they halt operations when they do so" (2)

pereric (528017) | about 2 years ago | (#42064755)

Can you really predict if it will halt?

Re:"and they halt operations when they do so" (1)

cheekyboy (598084) | about 2 years ago | (#42065375)

after the few 1000 fails you will get similar curves for time of fail after new.

Whoever did mod the parent up.... (2)

gentryx (759438) | about 2 years ago | (#42064921)

...doesn't understand the first thing about supercomputers, or even HPC. Currently virtually every HPC application uses MPI. And MPI doesn't take well to failing nodes. The supercomputer as a whole might still work, but the job will inevitably crash and needs to be restarted. HPC apps are usually tightly coupled. That sets them apart from loosely coupled codes such as a giant website (e.g. Google and friends)

Fault tolerance is a huge problem in the community and we don't have the answers yet. Some say that fault tolerance within the MPI layer (e.g. here [] ) will be sufficient. I personally very much doubt that. My bet is on higher-level frameworks, e.g. HPX [] , which can "abstract away" the location of a task from the node where its actually being executed.

Old problem (-1)

Anonymous Coward | about 2 years ago | (#42061979)

I learned 20 years ago that if a node time out (or returns a NAK) to give it's work to another node. I guess the new generation has forgotten that components can fail?

Re:Old problem (1)

fintler (140604) | about 2 years ago | (#42062149)

How do you give the work to another node when the failed node contains the only copy of its state (like in an MPI job)? Duplicating the state on multiple nodes is way too expensive.

Re:Old problem (2, Informative)

Anonymous Coward | about 2 years ago | (#42062353)

you start the job over.

You make sure that a single job's run time x the number of nodes is not so large that the chance of that job running to completion is not unreasonable.

On the previous ones I worked on the 60% job failure rate was around 100 nodes for 5 days, that comes down to the chance of a single node failing on a given day is .999 (you lose 1 out of 1000 nodes each day from something). The math is rather simple...0.999^500=60%. And in general you don't put dual power supplies, you don't mirror the disks...rerunning the jobs that failed is cheaper than increasing the node price to add things that only marginally improve things and also increase physical size.

If you have a single process bigger than that you need to setup a checkpointing system.

If you can split big jobs into lots of smaller pieces that can be pretty quickly put together at the end you do so.

On the previous one I was on they used both tricks depending on the exact nature of what was being processed.

For the most part it is not a complicated problem unless you expect unreasonably low failure rates and don't deal with reality.

ummm, no. (0)

Anonymous Coward | about 2 years ago | (#42061981)

Ever heard of Google? They're doing the massive supercomputer thing just fine and have completely figured out the fault tolerance side of things.

Re:ummm, no. (3)

fintler (140604) | about 2 years ago | (#42062183)

Google is having the same problems that this article describes -- they haven't fixed it either.

If your problem domain can always be broken down into map-reduce, you can easily solve it with a hadoop-like environment to get fault tolerance. If your application falls outside of map-reduce (the applications this article is referring to), you need to start duplicating state (very expensive on systems of this scale) to recover from failures.

On that scale (2)

shokk (187512) | about 2 years ago | (#42061983)

On that scale, distributed parallelism is key, where the system takes into account downed nodes and removes them from duty until it can return to service, or can easily add a replacement node to handle the stream. That's why Google and Facebook don't go down when a node fails.

Google/FB/etc are Embarassingly Parallel (1)

Anonymous Coward | about 2 years ago | (#42062163)

EP is trivial to deal with.. The problem is with supercomputing jobs that aren't EP, and rely on computations from nodes 1-100 to feed to the next computational step in node 101-200, etc.

The real answer is some form of fault tolerant computing.. think EDAC for memory (e.g. forward error correction) but on a grosser scale. Or, even, designing your algorithm for the ability to "go back N", as in some ARQ communication protocols.

The theory on this is fairly well known (viz Byzantine Generals problem)

The problem is that big supercomputing software is hard enough to write already without throwing in non-deterministic behavior of the processing. It's not like the compiler does it for you, and even if you have libraries that use multiprocessing (matrix math), that's usually a small part of the problem.

Re:Google/FB/etc are Embarassingly Parallel (1)

BitZtream (692029) | about 2 years ago | (#42062461)

When you have multiple nodes, you aren't any different than Google.

Google uses Map Reduce but it isn't the only way things get done.

You have standards of coding to deal with the issues. MapReduce is only one of those ways of dealing with the issue.

And for reference, what you describe in your first paragraph is EXACTLY a MapReduce problem. First 100 nodes Map, second hundred nodes Reduce the results. Rinse, repeat.

Re:Google/FB/etc are Embarassingly Parallel (2)

Jah-Wren Ryel (80510) | about 2 years ago | (#42063265)

And for reference, what you describe in your first paragraph is EXACTLY a MapReduce problem. First 100 nodes Map, second hundred nodes Reduce the results. Rinse, repeat.

No it's not. The problem with your description is the "rinse, repeat" part. He's not talking about repeating with new input data. He's talking about a serialized workload where, for example, the output of the first 100 jobs is the input for the next 100 jobs, which then creates output that is the input for the next 100 jobs. It's not a case of repeating, its a case of serialization where if you have not done state check-point and things crater you have to start from the begining to get back where you were. No "standard of coding" can fix that.

Re:Google/FB/etc are Embarassingly Parallel (1)

sirambrose (919153) | about 2 years ago | (#42065081)

Most problems can't be solved by a single map reduce. Map reduce tasks are normally written as a series of map reduce jobs; each map runs on the output of the previous reduce.

Since map reduce jobs write their output to disk at every step, it can be thought of as a form of check pointing. The difference between map reduce and mpi check pointing is that mpi needs to restart the whole job at the checkpoint, but map reduce frameworks can rerun just the work assigned to failed nodes. In the map reduce model, the checkpoint interval stays constant when adding additional nodes. With mpi, the checkpoint interval decreases as nodes are added because adding nodes increases the chance of at least one node failing in a given time interval and forcing the entire job to restart at the checkpoint.

Not all mpi jobs can easily be rewritten as map reduce jobs, but map reduce does address the problem discussed in the article.

Re:Google/FB/etc are Embarassingly Parallel (1)

Anonymous Coward | about 2 years ago | (#42063775)

Uh, no..

This kind of problem isn't one where you are rendering frames (each one in parallel) or where your work quanta is large. Think of doing a finite element code on a 3d grid. The value at each cell at time t=n+1 depend on the value of all the adjacent cells at time t=n. The calculation of cell x goes bonk, and you have to stop the whole shebang.

Re:On that scale (0)

Anonymous Coward | about 2 years ago | (#42064449)

Oh MAN! Image if you had a Boewolf cluster of those...
wait... thats what it is... or.
a Beowolf cluster of Beowolf clusters...
I digress.

Google and Facebook are data center centric search engines,
and they are database clusters, which is Disk bound,
where as a SuperComputer is compute bound,

The reason that Database clusters dont go down, ( aside from the Madona factorr..), is that they have a serious amount of redundancy and location redundancy too. Supercomputers have notorious single points of failures. What comes to mind was the ATT WGS6300 that was used for the gateway to the Cray at Ames. The plug for the gateway was just stuck in the wall. So, when someone kicked the plug out, the machine was not very useful:

IBM solved the reliability problem long ago with Project Stretch, which was a first incredibly unreliable. So they designed check pointing into the system, so that if anything went wrong, and a lot did. They could fix the problem, and back up to the instruction that was executing when the failure occurred, and continue from there.

In summary, Some supercomputers have single points of failure, others have multiple points of failure, and some have mega redundancy, like distributed computing and database clusters.

This problem is not new (0)

Anonymous Coward | about 2 years ago | (#42062031)

Back in the 60's, ENIAC (the supercomputer of the day) ran on thousands of vacuum tubes, MTBF was
about 5.6 hours, then an IT person would have to figure out which tube to replace.

Really? (1)

Pandaemonium (70120) | about 2 years ago | (#42062175)

Haven't these problems already been solved by large-scale cloud providers? Sure, hurricanes take out datacenters for diesel, but Google runs racks with dead nodes until it hit's a percentage where it makes sense to 'roll the truck' so-to-speak and get a tech onsite to repair the racks.

Re:Really? (1)

Anonymous Coward | about 2 years ago | (#42062441)

Pretty much also solved in all but the smallest HPC context.

They mention that overhead for fault revocery can take up 65% of the work. That's why so few large systems bother anymore. They run jobs and if a job dies, oh well the scheduler starts it over. Even if a job breaks twice in a row near the end, you still break even, and that's a relatively rare occurrence (job failing twice is not unheard of, but both times right at the end of a job is highly unlikely.

This is the same approach in 'cloud' applications. They have more thorughput then they need and hard failures manifest simply as transient drop in throughput. Yes if you need predictive runtime, running the same job multiple times concurrently may be warranted (again, running the exact same job three times concurrently is *still* no worse than checkpointing overhead, and much simpler), The speaker is hawking their own MPI specifically toward this goal, but I'm not sure if such a thing is even needed based on what I've seen.

Their methoid is nothing new. (1)

Jartan (219704) | about 2 years ago | (#42062395)

This is the same method used to handle distributed computing with untrusted nodes. Simply hand off the same problem to multiple nodes and recompute if differences arise.

The real solution is going to involve hardware as well. The nodes themselves will become supernodes with built in redundancy.

Re:Their methoid is nothing new. (1)

Yvan Fournier (1286876) | about 2 years ago | (#42062855)

Actually, things are much more complex, and as some other poster mentioned, these issues are the continuing subject of research, and are expected by the supercomping community since quite a few years (simply projecting current statistics, the time required to checkpoint a full-machine job is would at some point become bigger thant the MTBF...)
The PhD student mentioned seems to be just one of many working on this subject. Different research teams have different approaches, some trying to hide as much possible in the runtimes and hardware, others experimenting with more robust algorithms in applications.

Tradeoffs on HPC clusters are not the same as on "business" type computers (high-throughput vs. high availability). For tightly coupled computations, a lot of data is flying around the network, and networks on these machines are fast, high throughput, and especially low-latency networks, with specific hardware, and in quite a few cases, partial offload of message management, using DMA writes or other techniques which might make checkpointing message queues a tad complex. The new MPI-3 standard has only minimal support for error handling, simply because this field is not mature/consensual enough, in the sense that not everyone agrees on the best solutions yet, and these may depend on the problem being solved and its expected running time. Avoiding too much additional application complexity and major performance hits is not trivial.

In addition, up to now, when medium to large computations are batch jobs that may run a few hours to a few days on several thousand cores, re-running one a computation that failed due to hardware failures once in a while (usually much less than one in 10 times is much more cost-effective than duplicating everything, in addition to being faster. These applications do not usually require real-time results, and even for many time-constrained applications (such as tomorrow's weather), running almost twice as many simulations (or running them twice as fast in the case of ideal speedup) might often be more effective. This logic only breaks with very large computations.

Also, regarding similarity to the cloud, when 1 node goes down on most clusters, the computation running on it will usually crash, but when a new computation is started by a decent resource manager/queuing system, that node will not be used, so everything does not need to be replaced immediately (that issue is at least solved). So most jobs running on 100th or 1/10th of a 100000 to 1 million node cluster will not be too much affected by random failures, but a job running on the full machine will be much more fragile.

So, as machines get bigger and these issues become statistically more of an issue, an increasing portion of the HPC hardware and software effort needs to be devoted to these, but the urgency is not quite the same as if your bank had forgotten to use high-availability features for its customer's account data, and the dradeoffs reflect that.

Oblig Bad Car Analogy (1)

PPH (736903) | about 2 years ago | (#42062433)

Come on now. We've known how to run V8 engines on 7 or even 6 cylinders for years now. Certainly this technology must be in the public domain by now.

I have a demonstration unit I would be happy to part with for a mere $100K.

Re:Oblig Bad Car Analogy (1)

afgam28 (48611) | about 2 years ago | (#42063267)

Demonstration units can be had for much less :P

For example a Chrysler 300C [] costs $36,000, and has has a Hemi V8 with Chrysler's Multi-Displacement System []

Tandem (1)

SpaceLifeForm (228190) | about 2 years ago | (#42062551)

Distributed fault tolerance. Diifficult to scale large and keep supercomputer performance. But still room for improvement.

In addition... (0)

Anonymous Coward | about 2 years ago | (#42062613)

In most modern problems simply scaling the code to that level of parallelism is almost always an unsolved problem. There is still a lot of work to be done in this arena. Sparse linear algebra for example has serious scaling issues beyond a certain number of nodes, irregular memory access and communication patterns really bungle up parallelism.

Possible Solution (1)

Master Moose (1243274) | about 2 years ago | (#42062973)

Could they try turning it off then on again?

reply (-1)

Anonymous Coward | about 2 years ago | (#42063033)

Shanghai Shunky Machinery Co.,ltd is a famous manufacturer of crushing and screening equipments in China. We provide our customers complete crushing plant, including cone crusher, jaw crusher, impact crusher, VSI sand making machine, mobile crusher and vibrating screen. What we provide is not just the high value-added products, but also the first class service team and problems solution suggestions. Our crushers are widely used in the fundamental construction projects. The complete crushing plants are exported to Russia, Mongolia, middle Asia, Africa and other regions around the world.

gn44 (-1)

Anonymous Coward | about 2 years ago | (#42063065)

Not Really New (3, Insightful)

Jah-Wren Ryel (80510) | about 2 years ago | (#42063199)

The joke in the industry is that supercomputing is a synonym for unreliable computing. Stuff like checkpoint-restart was basically invented on super-computers because it was so easy to lose a week's worth of computations to some random bug. When you have one-off systems or even 100-off systems you just don't get the same kind of field testing that you get regular off-the-shelf systems that sell in the millions.

Now that most "super-computers" are mostly just clusters of off-the-shelf systems we get a different root cause but the results are the same. The problem now seems to be that because the system is so distributed so is the state of the system - with a thousand nodes you've got a thousand sets of processes and ram to checkpoint and you can't do the checkpoints local to each node because if the node dies, you can't retrieve the state of that node.

On the other hand, I am not convinced that the overhead of checkpointing to a neighboring-node once every few of hours is really all that big of a problem. Interconnects are not RAM speed, but with gigabit+ speeds you should be able to dump the entire process state from one node to another in a couple of minutes. Back-of-the-napkin calculations say you could dump 32GB of ram across a gigabit ethernet link in 10 minutes with more than 50% margin for overhead. Doing that once every few hours does not seem like a terrible waste of time.

Re:Not Really New (1)

markhahn (122033) | about 2 years ago | (#42063525)

yes, checkpointing is a reasonable answer. no large machines use Gb, though.

Re:Not Really New (1)

Jah-Wren Ryel (80510) | about 2 years ago | (#42064925)

Yeah, I was just using a conservative interconnect that most people here could relate too.

Re:Not Really New (0)

Anonymous Coward | about 2 years ago | (#42064405)

Actually with modern fdr ib infrastructure it would take you about 10 seconds or less.

Checkpointing overhead (1)

Anonymous Coward | about 2 years ago | (#42064721)

The point of the article was that as the number of nodes involved in the calculation increases, the frequency at which at least one of them fails increases too (provided that the individual node failure rate is kept constant). Since you want at least one checkpoint between each typical failure, you would therefore have to checkpoint more and more often as the number of nodes is increased. Hence, the overhead involved with checkpointing goes up as the number of nodes involved increases, and with 100 times more nodes than most clusters use now, this overhead grows to overwhelm the amount of resources used for the actual calculation.

This problem applies to all global modes of handling failures, i.e. when the whole system stops due to the failure of one or a few nodes.

Re:Checkpointing overhead (1)

Jah-Wren Ryel (80510) | about 2 years ago | (#42064943)

The point of the article was that as the number of nodes involved in the calculation increases, the frequency at which at least one of them fails increases too (provided that the individual node failure rate is kept constant). Since you want at least one checkpoint between each typical failure, you would therefore have to checkpoint more and more often as the number of nodes is increased. Hence, the overhead involved with checkpointing goes up as the number of nodes involved increases, and with 100 times more nodes than most clusters use now, this overhead grows to overwhelm the amount of resources used for the actual calculation.

Thanks for spelling that out. I don't know why I missed that reading the article. Your post deserves to be modded +5 informative.

Kei (1)

JanneM (7445) | about 2 years ago | (#42063291)

A machine like Kei in Kobe does live rerouting and migration of processes in the event of node or network failure. You don't even need to restart the affected job or anything. Once the number of nodes and cores exceed a certain level you really have to assume a small fraction are always offline for whatever reason, so you need to design for that.

this Fp for GNAA (-1)

Anonymous Coward | about 2 years ago | (#42063309)

DARPA's Exascale Report (1)

Anonymous Coward | about 2 years ago | (#42063435)

DARPA's report ( [] ) has a lot of interesting information for those who want to read more on exascale computing. I may be a bit biased being a grad student in HPC too, but the linked article didn't impress me.

Old news (0)

Anonymous Coward | about 2 years ago | (#42064029)

So, exactly the same problem everybody with a large data centre has had for years. If it was competently designed, few, if any, of the failures will lead to the whole computer failing.


mspohr (589790) | about 2 years ago | (#42064081)

Reminds me of ENIAC. I went to the Moore School at U of Penn and ENIAC parts were in the basement (may still be there, for all I know). The story was that since it was all vacuum tubes, at the beginning they were lucky to get it to run for more than 10 minutes before a tube blew.

That being said, I can't believe that supercomputers don't have some kind of supervisory program to detect bad nodes and schedule around them.

Phd with no clue... (2)

Fallen Kell (165468) | about 2 years ago | (#42064477)

I am sorry to say it, but this Phd student has no clue. Dealing with a node failure is not a problem with proper, modern supercomputing programming practices as well a OS/system software. There is an amazing programming technique called "checkpointing", developed a while ago. This allows you to periodically to "checkpoint" your application, essentially saving off the system call stack, the memory, register values, etc., etc., to a file. The application is also coded to check to see if that file exists, and if it does, to load all those values back into memory, registers, call stack, and then continue running from that point. So in the event of a hardware failure, the application/thread is simply restarted on another node in the cluster. That is application level checkpointing, there is also OS level checkpointing, which essentially does the same thing, but at the OS level irregardless of the processes running on the system, allowing for anything running on the entire machine to be checkpointed and restarted from that spot.

Then there is the idea of a master dispatcher, which essentially breaks down the application into small chunks of tasks, and then sends those tasks to be calculated/performed on a node in the cluster. If it does not get a corresponding return value from the system it sent the task within a certain ammount of time, it re-sends to another node (and marking the other node as bad and not sending future tasks to it until that value is cleared).

Both of these methods fix the issue of having possible nodes which die on you during computation.

Re:Phd with no clue... (1)

Anonymous Coward | about 2 years ago | (#42064643)

There is an amazing programming technique called "checkpointing", developed a while ago.

No shit, Sherlock. Now tell me what happens when a 2048-core job needs to write its 4TB of RAM to disk every hour.

Re:Phd with no clue... (0)

Anonymous Coward | about 2 years ago | (#42064745)

And a 2048-core job is peanuts compared to what the article was talking about, which have millions of nodes, each with several cores. The amount of ram and disk available per core is not expected to increase much, but since the aggregate failure rate wil be proportionally higher as you add more cores, these systems will fail more than a thousand times more often than your tiny 2048-core example, and hence need to checkpoint tousands of times more often. So that would be a checkpoint every second or so. This is the real killer, and why checkpointing does not scale to exaflops performance.

Re:Phd with no clue... (1)

Shavano (2541114) | about 2 years ago | (#42066273)

No, that doesn't solve the problem in general. It only solves the problem in the case where getting a result in a timely fashion is not critical.

2012 Cheap UGG Boots, Handbag sale (0)

minklooping (2779373) | about 2 years ago | (#42064901)

welcome to: === [] === Hello, everybody, the good shoping place, the new season approaching, click in. ===== [] ===== Discount UGG Boots, Air Jordan (1-24) shoes $35, Air max shoes (TN LTD BW 90 180) $36, Nike/shox (R4, NZ, OZ, TL1, TL2, TL3) $35, Handbags ( Coach Lv fendi D&G) $36, T-shirts (polo, ed hardy, lacoste) $20, Jean (True Religion, ed hardy, coogi)$35, Sunglasses ( Oakey, coach, Gucci, Armaini)$16, New era cap $12, (NFL MLB NBA NHL) jerseys $25, free shipping, Accept credit card and (PAYPAL).

Try Leveraging Database Technology (0)

Anonymous Coward | about 2 years ago | (#42065705)

There has been over 30 years research and utilization of ACID and High Availability techniques;
    2 Phase Commit
    Hot Standby
    Replication ...

in database technology and products. Surely its time the HPC community started looking at and using some of these techniques rather re-inventing the wheel.

It may seem preposterous, but its perfectly feasible to run CUDA and OpenCL code as Database Stored Procedures on GPU's. Stored Procs implement a
natural Map-Reduce paradigm, Database technology isn't the complete solution but its a good starting point.

redunancy solution (1)

ILongForDarkness (1134931) | about 2 years ago | (#42066993)

The guy in the article that says that running the redundant copies on the same nodes would reduce i/o traffic: I'd love to speak to him. There are two options I see:

1) Assuming that there is common source data that both instances need to churn on: so the data isn't redundant so what exactly are you proving by getting the same result? Diddo with CPU, integer unit etc same hardware is not a redundant solution.

2) No shared data but you are generating data on each node: so they still have to chat with eachother to synchronize and such. So now you have 1/3 of a CPU node and 1/3 of I/O. But wait that's not all you also are trashing the hell out of your cache, if any of the stuff is on disk you are pretty much guaranteeing that at least one of the instances on the node has to wait for the drive latency since one or the other is going to get the disk first etc.

Either way you still have the problem of a node blowing up and all the simultaneous copies of the sim dying together: In short you can't call something a redundant system without actually having redundancy.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?