Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Whose Bug Is This Anyway?

Soulskill posted about 2 years ago | from the it's-nobody's-fault-and-everybody's-angry dept.

Bug 241

An anonymous reader writes "Patrick Wyatt, one of the developers behind the original Warcraft and StarCraft games, as well as Diablo and Guild Wars, has a post about some of the bug hunting he's done throughout his career. He covers familiar topics — crunch time leading to stupid mistakes and finding bugs in compilers rather than game code — and shares a story about finding a way to diagnose hardware failure for players of Guild Wars. Quoting: '[Mike O'Brien] wrote a module ("OsStress") which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second. On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!'"

Sorry! There are no comments related to the filter you selected.

The memory thing... (5, Informative)

Loopy (41728) | about 2 years ago | (#42332529)

...is pretty much what those of us that build our own systems do anytime we upgrade components (RAM/CPU/MB) or experience unexplained errors. It's similar to running the Prime95 torture tests overnight, which also checks calculations in memory against known data sets for expected values.

Good stuff for those that don't already have a knack for QA.

Re:The memory thing... (0, Troll)

Gothmolly (148874) | about 2 years ago | (#42332705)

Nobody does that anymore. The defect rate on hardware is so low you don't need to - buy your stuff from Newegg, assemble, and install. Either it's DOA or runs forever.

Re:The memory thing... (5, Interesting)

AaronLS (1804210) | about 2 years ago | (#42332753)

"The defect rate on hardware is so low you don't need to"

I think the point of the article is to cast significant doubt on statements like this.

Re:The memory thing... (3, Informative)

Sir_Sri (199544) | about 2 years ago | (#42333371)

Even if you have a small calculation failure rate, it's not practical for an end user to recognize that as a hardware partial failure rather rather than a software bug.

From the perspective of the average user, yes, it either works or it doesn't. If you use something bit (like wow/guildwars or the like) and they can diagnose it for you then you might have an argument. But even then, 1% could be overclocking or, as the author of TFA says, heat or PSU undersupply issues. That's not 'defective' hardware, that's temperamental hardware or the user doing it wrong. And because it's rare it's not necessarily serious, most users can handle the odd application crash in something like an MMO once every few days.

It does mean a bug hunter needs to know what is happening though.

Re:The memory thing... (5, Informative)

DMUTPeregrine (612791) | about 2 years ago | (#42332757)

Unless you're trying to overclock.
Admittedly that's a small percentage of the populace, even among people who build their own systems.

Re:The memory thing... (3, Insightful)

Greyfox (87712) | about 2 years ago | (#42333731)

Heh, back in the day when I was doing OS/2 phone support, I had a customer call up with a trap zero during install. Now I'd seen a lot of odd shit during an OS/2 install, but I'd never seen a trap zero. Turns out that was a divide by zero error. Fucker made me start filling out the paperwork to send him to level 2 before admitting that he was trying to overclock his processor. If memory servers me correctly (Which it might not, nearly two decades later) he was trying to go from 8 mhz to 20 mhz, and was also getting a lot of crashes in DOS and DOS applications. I told him that was probably what his problem was and if I tried to send this on to level 2 it'd be rejected with a "Don't do that," so I was just going to save him some time and tell him "don't do that" now.

Re:The memory thing... (3, Informative)

Runaway1956 (1322357) | about 2 years ago | (#42332949)

" Either it's DOA or runs forever."

Nonsense. I bought 8 gig of memory about 4 years ago, for an Opteron rig. That computer recently started having serious problems, with corrupted data and crashing. I looked at all the other components first, then finally ran memory tests. Memtest failed immediately. I removed three modules and ran memtest again, it failed immediately. Replaced with another module, memtest ran for awhile, then failed. The other two modules proved to be good, so I am now running that aging Opteron with 4 gig of memory.

Yeah, yeah, yeah - I realize a single person's anecdotal evidence doesn't carry much weight. I wonder what the statistics are though? As AaronLS already pointed out, these tests seem to indicate that my situation isn't very unusual. Components age and wear out.

Re:The memory thing... (4, Informative)

scheme (19778) | about 2 years ago | (#42333061)

Yeah, yeah, yeah - I realize a single person's anecdotal evidence doesn't carry much weight. I wonder what the statistics are though? As AaronLS already pointed out, these tests seem to indicate that my situation isn't very unusual. Components age and wear out.

Check out "A study of DRAM failures in the field" from the supercomputing 2012 proceedings. They have some interesting stats based on 5 million DIMM days of operation.

Re:The memory thing... (1)

Anonymous Coward | about 2 years ago | (#42333345)

Paywall.

I thought you were going to talk about the google study because it's invalid since they are using low grade RAM.

Re:The memory thing... (0)

Anonymous Coward | about 2 years ago | (#42333137)

Add in that hardware can fail in strange ways. Years and years ago I had a client's computer that would run Windows for a little while and then suddenly crash. Running diagnostics on it didn't show anything wrong unless the diagnostics ran long enough. The diagnostics thoroughly checked the operation of each motherboard component one after the other. Running the diagnostics briefly didn't turn up a damn thing. Only by letting them run for hours did the cause of the crashes make itself known. One of the interrupt controllers would suddenly start failing after 30000 or so tests. It took hours for the diagnostics to hit the interrupt controller enough to make it fail. When running Windows of course the interrupt controller got to the point of failure a whole heck of a lot faster.

Re:The memory thing... (5, Informative)

Alwin Henseler (640539) | about 2 years ago | (#42333015)

The defect rate on hardware is so low you don't need to - buy your stuff from Newegg, assemble, and install. Either it's DOA or runs forever.

Look up "bathtub curve" sometime. Even well-built, perfectly working gear is aging, aging usually translates into "reduced performance / reliability", and any electronic part will fail sometime. Possibly gradually. Especially the just-makes-it-past-warranty crap that's sold these days. And there may be instabilities / incompatibilities that only show under very specific conditions (like when a system is pushed really hard).

That's ignoring things like ambient temperature variations, CPU coolers clogging with dust over the years, sporadic contact problems on connectors, or the odd cosmic ray that nukes a bit in RAM (yes that happens [wikipedia.org] , too). A lot of things must come together to have (and keep) a reliable working computer, so a lot of things can go wrong and put an end to that.

Re:The memory thing... (3, Interesting)

TubeSteak (669689) | about 2 years ago | (#42333229)

Especially the just-makes-it-past-warranty crap that's sold these days.

Actually, to get 95% of your product past the warranty period, you have to overengineer because, statistically, some of your product will fail earlier than you expect.

So if you have a 3 year warranty, you better be engineering for 4+ years or you're going to spend a lot on replacements for the near end of the bathtub curve.

I've had an unfortunate amount of experience with made in china crap that's ended up being replaced a few times within the warranty period.

Re:The memory thing... (5, Interesting)

DigiShaman (671371) | about 2 years ago | (#42333339)

I think it's a crying shame that the PC industry hasn't forced ECC as a mandatory standard. Servers and workstations have it, and with memory as cheap as it is to fab, there's absolutely -zero- excuse not to use ECC!!! With the transistor count as densely packed and small, errors will occur. I'll go a step further and even recommend ECC throughout the entire motherboard bridge buses. End-to-end error correction should be a requirement!

Re:The memory thing... (1)

scheme (19778) | about 2 years ago | (#42333045)

That's not true. There was a recent paper looking at memory defects and causes on the Jaguar supercomputer, and memory errors were moderately common. Just as surprisingly, there were errors were a single DIMM going bad would cause errors for all the DIMMs on that channel.

So, memory does go bad and it does that more frequently than you'd expect.

Wait its possible?! (5, Funny)

Anonymous Coward | about 2 years ago | (#42332533)

You mean all those times when my code was 'fine' and i gave up it really could have been the compiler or a memory problem

shit i'm a much better programmer than i realized

Re:Wait its possible?! (1)

meerling (1487879) | about 2 years ago | (#42333035)

I started bringing my personal laptop to my programming classes for a simple reason. About 20% (seemed like 65%, but that's probably just a trick of memory)of the class computers had their compilers borked by another student and any particular time. You had no idea that somebody had put in a weird setting in the compiler, or had just outright broke something, until after you'd done way too much troubleshooting. I found I got a whole lot more done on my personal box that nobody else could mess up. :)

Re:Wait its possible?! (2)

hazah (807503) | about 2 years ago | (#42333407)

Are you talking about actual compilers or the IDE?

Re:Wait its possible?! (0)

Anonymous Coward | about 2 years ago | (#42333437)

How is that possible? Students shouldn't have sufficient privileges to change anything on the computer. Also, the computers should be resetting themselves at least every night, if not at every log in. Your school has incompetent IT staff.

Re:Wait its possible?! (0)

Anonymous Coward | about 2 years ago | (#42333625)

Students shouldn't have sufficient privileges to change anything on the computer.

Yes they should. They can do as much damage with a shared user (which I imagine was the case) as with root. And what if they want to install something, like valgrind, vim or clang? It's much easier to do "sudo aptitude install something-with-way-too-many-dependencies" than hunting every single dependency. Sure public machines may be running keylogger software, but if you trust a public machine then you're already screwed one way or the other (how can you be sure it isn't running a hardware keylogger, for instance?).

Re:Wait its possible?! (0)

Anonymous Coward | about 2 years ago | (#42333365)

No, he's saying that about 1% of your users could have issues that aren't related to the program, and crash reports from said systems are likely suspect, and should be ignored.

Re:Wait its possible?! (5, Insightful)

disambiguated (1147551) | about 2 years ago | (#42333499)

You're a better programmer for assuming it's not a compiler bug and trying harder to figure out what you did wrong.

I've been programming professionally for over 20 years, mostly in C/C++ (MSVC, GCC, and recently CLang (and others back in the olden days)). I've seen maybe two serious compiler bugs in the past 10 years. They used to be common.

On the other hand, I can't count how many times I've seen coders insist there must be a compiler bug when after investigation, the compiler had done exactly what it should according to the standard (or according to the compiler vendor's documentation when the compiler intentionally deviated from the standard).

By "serious", I mean the compiler itself doesn't crash, issues no warnings or errors, but generates incorrect code. Maybe I've just been lucky. (Or maybe QA just never found them ;-)

Oh, and btw, yes I realize you were joking (and I found it funny.)

Even better! (5, Funny)

Roger W Moore (538166) | about 2 years ago | (#42333587)

the compiler had done exactly what it should according to the standard...

That's even better - it means that you've found a bug in the standard! ;-)

OsStress (5, Informative)

larry bagina (561269) | about 2 years ago | (#42332567)

Microsoft found similar [msdn.com] impossible bugs when overclocking was involved.

Re:OsStress (2, Informative)

Anonymous Coward | about 2 years ago | (#42332739)

That's not too surprising. For instance if you try to read too fast from memory, the data you read may not be what was actually in the memory location. Some bits may be correct, some may not. Sometimes the incorrect values may relate to the data that was on the bus last cycle, eg there has not been enough time for the change to propagate through. This can easily lead to the data apparently read being a value that should not be possible. This is why overclocking is not a good idea for mission critical systems, although of course it can be fun to push a system a bit harder to get better performance for non critical applications.
John

Re:OsStress (2, Insightful)

DNS-and-BIND (461968) | about 2 years ago | (#42332857)

We all realize that when Intel bakes a bunch of processors, they come out all the same, and then Intel labels some as highspeed, some as middle, and some as low. They are then sold for different prices. However, they are the exact same CPU.

Overclocking isn't the issue, because the CPUs are the same. The problem arises when aggressive overclocking is done by ignorant hobbyists or money-grubbing computer retailers. They overclock the computer to where it crashes, and then back off just a little bit. "There! Now I've got a real MEAN MACHINE," he thinks.

Re:OsStress (5, Insightful)

Anonymous Coward | about 2 years ago | (#42333049)

Bullshit. While Intel does occasionally bin processors into lower speeds to fulfill quotas and such, often times those processors are binned lower because they can't pass the QA process at their full speed. But they can pass the QA process when running at a lower speed. These processors were meant to be the same as the more expensive line, but due to minor defects can't run stably or reliably at the higher speed. Or at least not enough for Intel to sell them at full speed.

Which is a large part of why some processors in the same batch can handle it when others can't.

As much as I hate Intel, I think we could at least realize that they are often times doing this with good reason.

Re:OsStress (2)

DNS-and-BIND (461968) | about 2 years ago | (#42333373)

Nope! It's the same processor. Sure, some come out different, but oftentimes there are loads of perfectly good processors that get underclocked for marketing reasons only. It's not like the ratios come out perfectly every time, which is what you seem to be implying. They often times don't do it with good reason. Intel is very big into marketing. If they were an engineering firm, they'd sell one product at one price and be done with it.

Re:OsStress (0)

Anonymous Coward | about 2 years ago | (#42333685)

Sure, some come out different

Which contradicts:

they come out all the same

So which is it?

Re:OsStress (2, Informative)

Anonymous Coward | about 2 years ago | (#42333145)

We all realize that when Intel bakes a bunch of processors, they come out all the same, and then Intel labels some as highspeed, some as middle, and some as low. They are then sold for different prices. However, they are the exact same CPU.

This is not 100% correct. When Intel or other fabricators of microprocessors make the things they do use the same "mold" to "stamp out" all the processors in large bunches all based on the same design, however they don't get the exact same result each time. The little difference from chip to chip, like on this chip some transistors ended up a few atoms closer together than what is the optimum distance so this part of the processor now will heat up more when in use, or on this chip someone coughed* during the process and smeared the result and its totally unusable, or on this this chip part of the cache memory is fubared and has to be disabled.

The end result is they have a chip and they have to test it to see how well it performs because of all these variables in the manufacturing process. One chip might be 100% reliable and operate under the desired temperature at clock speed A, while another chip due to its unique manufacturing imperfections has problems at clock speed A, either its too hot or needs too much voltage or has calculation errors, but when lowered to clock speed B it works just fine. I believe they call this process "binning" and its the main thing that separates the chips into different speeds and capabilities.

IT IS HOWEVER a known practice that the chip manufacturers will sometimes take a processor that is just fine and dandy to work at clock speed A but they label it a slower clock speed B part because they are running low on clock speed B parts and it makes better financial sense to sell it as such instead of lowering the price on their clock speed A parts. Sometimes its more than a clock speed, sometimes its the intentional disabling of capabilities of the processor to make it match their budget models like disabling some of the on board cache memory or some of the (working) cores.

What it comes down to is that it costs the processor manufacturers the exact same amount to make all the different speed processors in a given family, but they don't all come out the same. The worst ones are put on the low end, the better quality ones on the high and expensive end, and sometimes there is a perfect high quality one that is sold as a low end one because they need to produce and ship more low quality ones. If you get one of those, then consider yourself lucky and overclock the shit out of it. All processors can be overclocked, as the manufacturers make the official speed the 100% stable and error free operation with a normal (not aftermarket) cooling solution that will last for the lifetime of the warranty. You just sometimes get lucky and have a processor that is super easily overclocked because it could have been labeled and sold as a higher speed to begin with. This is never guaranteed however.

*Yeah its more complicated, they aren't using molds that they press, they are using some complicated look-it-up-and-read-if-you-are-really-interested stuff to make the things so amazingly small

Re:OsStress (1)

Anonymous Coward | about 2 years ago | (#42333169)

That's so shockingly wrong that I hope no one confuses it for the truth. I assume you have no actual engineering experience with Si process technology, validation of processors, or anything related to electrical engineering.

A fab process has a target set of parameters. Sometimes there are deviations. A processor is designed to yield well at a certain frequency for the process targets. Each wafer is slightly different. Individual die on a wafer are different. Some are outright flawed and thrown away. Some have higher static leakage and must be run at lower voltage to hit the TDP spec, and often are down-binned. Some are slower and some are faster. There is a complex process on a high volume tester to find the sweet spot of frequency, voltage, and TDP to bin a part in the right SKU. Then, each part is dynamically fused with a various voltage levels for min frequency, max frequency, turbo frequency, etc.

On a really mature process node with a really mature design that's not being pushed to the limits of performance, all the processors may be mostly the same. Not the latest and greatest stuff.

Re:OsStress (0)

Anonymous Coward | about 2 years ago | (#42333201)

We all realize that when Intel bakes a bunch of processors, they come out all the same, and then Intel labels some as highspeed, some as middle, and some as low. They are then sold for different prices. However, they are the exact same CPU.

No the CPUs are not exactly the same. Each and every CPU is unique. Just like a snow flake. None of them is entirely perfect, otherwise they'd all run exactly the same at any speed. And we know that they don't.

Re:OsStress (5, Interesting)

Anonymous Coward | about 2 years ago | (#42332985)

Then again, it might not be overclocking after all [msdn.com] .

More relevantly, Microsoft has access to an enormous wealth of data about hardware failures from Windows Error Reporting. This paper [microsoft.com] has some fascinating data in it:

- Machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault
- Machines that crashed once had a probability of 1 in 3.3 of crashing a second time
- The probability of a hard disk failure in the first 5 days of uptime is 1 in 470
- Once you've had one hard disk failure, the probability of a second failure is 1 in 3.4
- Once you've had two failures, the probability of a third failure is 1 in 1.9

Conclusion: When you get a hard disk failure, replace the drive immediately.

Caution: (5, Funny)

fahrbot-bot (874524) | about 2 years ago | (#42332597)

Bug hunts on LV-426 [wikipedia.org] often end badly.

Re:Caution: (1)

asliarun (636603) | about 2 years ago | (#42332643)

Its them damn cosmic rays, I tell ya.

The death of Moore's law, they will be.

Re:Caution: (1)

fragMasterFlash (989911) | about 2 years ago | (#42333069)

Its them damn cosmic rays, I tell ya.

The death of Moore's law, they will be.

Or the reason semiconductor houses switch from conventional (bulk CMOS) processes to Silicon-on-Insulator [wikipedia.org] . Many SOI processes are rad hardened by default.

Re:Caution: (0)

Anonymous Coward | about 2 years ago | (#42333535)

May be they want to control leakages at small geometry, perhaps?
Rad hardened is a side effect, not a feature they are after.

  Beside, to have true rad harden, there are other things that need to be taken care off and not just the chip. e.g.
- Shielding the package against radioactivity, to reduce exposure of the bare device.
- Shielding the chips themselves by use of depleted boron (consisting only of isotope Boron-11) in the borophosphosilicate glass passivation layer protecting the chips, as boron-10 readily captures neutrons and undergoes alpha decay (see soft error).
- Redundant elements can be used at the system level.
- Redundant elements may be used at the circuit level.
- Hardened latches may be used.
http://en.wikipedia.org/wiki/Radiation_hardening

Re:Caution: (2)

Roger W Moore (538166) | about 2 years ago | (#42333651)

Many SOI processes are rad hardened by default.

Rad hard usually means that they are not damaged by radiation e.g. you can stick them close to an LHC beam as part of a detector and the massive radiation dose they receive will not cause the device to permanently cease functioning (or at least last longer before it fails). On the other hand cosmic rays which slow down and stop in material can cause a large amount of local ionization. This can be enough to flip the state of a memory bit which can cause crashes. As devices get smaller the charge needed to flip the state gets less and so more cosmic rays are capable of depositing enough charge to make a difference. These two processes are different: one is a permanent failure of the device the other is just temporarily flipping the state.

Re:Caution: (0)

Anonymous Coward | about 2 years ago | (#42332789)

Obligatory:
Nuke them from orbit. It's the only way to be sure.

I don't believe 1% of computers give wrong answers (0, Flamebait)

PineGreen (446635) | about 2 years ago | (#42332659)

I think this is bull. I just don't believe 1% of computers give wrong answers. There are many reasons why precomputed table might differ - threading, reordering of floating point operations, etc. Basically, compilers guarantee certain precision, not by-bit determinstic result (unless you set up certain IEEE flags, which are not on by default).

Re:I don't believe 1% of computers give wrong answ (1)

Desler (1608317) | about 2 years ago | (#42332743)

I think this is bull. I just don't believe 1% of computers give wrong answers.

Why would he lie about it?

Re:I don't believe 1% of computers give wrong answ (5, Insightful)

PaladinAlpha (645879) | about 2 years ago | (#42332769)

You don't have any idea what you're talking about, and that's why you don't understand what he's talking about.

Re:I don't believe 1% of computers give wrong answ (4, Informative)

godrik (1287354) | about 2 years ago | (#42332781)

I actually believe it. I am sure they might have think of floating point precision problem. But most likely they only used integers. That's what prime 95 and memtest are doing. Integer and memory operations uncover most common hardware failure. I encountered many computers with faulty hardware when stressed. And I am sure guildwars was stressful.

Re:I don't believe 1% of computers give wrong answ (2)

UnknownSoldier (67820) | about 2 years ago | (#42332955)

> I actually believe it. I am sure they might have think of floating point precision problem.

I can believe it. Ten years ago on one the PC games I worked on there were significant floating-point differences between Intel and AMD. Fortunately it was an RTS so we could get away with fixed-point. If we would of been forced to deal with floats it would of been a hassle to keep them "in sync."

Floating-point is an approximation anyways, so IMHO

a) the server should be making the authoritative decision(s), and
b) should be sending a quantized result to the clients.

Re:I don't believe 1% of computers give wrong answ (2)

tepples (727027) | about 2 years ago | (#42333255)

Fortunately it was an RTS so we could get away with fixed-point.

Does it really vary by genre? For a game world the size of Liechtenstein, a 32-bit fixed-point length gives precision down to 10 microns or so. And even in a vast open world, you start to get glitches like the far lands in Minecraft [minecraftwiki.net] if you stray more than 12.5 million units from the origin.

Re:I don't believe 1% of computers give wrong answ (5, Insightful)

Jeremi (14640) | about 2 years ago | (#42332851)

I think this is bull. I just don't believe 1% of computers give wrong answers

1% of all computers? Probably not.

1% of gamers' computers, in an era when PC gaming technology was progressing very quickly, and so gamers were often running overclocked (or otherwise poorly set up) hardware? Sounds plausible enough.

Re:I don't believe 1% of computers give wrong answ (0)

Anonymous Coward | about 2 years ago | (#42333093)

I believe it, I am actually surprised by the low number. The problem with memory is that a simple memory test often fails to detect errors, so a POST will not find anything but the easiest faults.

memtest86+ is actually really smart, it tries a lot of different kind of access methods and bit patterns for certain types of errors. Even memtest86+ can fail to detect memory errors.

Interestingly enough the best test for memory errors they found is using the gcc compiler, try to compile gentoo (including X, kde, perl (perl is huge)) with many parallel compiles so that it will hit swap slightly and hope that it won't segfault.

This article sounds interesting, I haven't read it yet. On Dr. Dobbs there was also an article not long ago about checking (like fsck) your application data structures continuously to check for errors. I am almost thinking that you should run a CRC on all your data structures. Nothing is worse than a customer who has faulty memory.

Re:I don't believe 1% of computers give wrong answ (2)

Alwin Henseler (640539) | about 2 years ago | (#42332861)

I won't go into specific reasons you mention, but it is perfectly possible to write code that has a known, fully deterministic result. After all: compilers produce machine code, and the bulk of that is integer operations which have exactly defined behavior with 0 room for interpretation (when it comes to digital logic like CPU's, "defined" is deterministic). Maybe there are exceptions (like floating point? don't count on it), maybe for some types of operations you need to sidestep a compiler and code some assembly directly, but that's beside the point.

With that in hand, expect some of computed results to turn out wrong. Knowing what junk parts go into computers sometimes, how shoddy some machines are built, and how some people abuse their computers, I'd think a 1% failure rate is probably on the low end of the scale.

For example, try running Memtest86 [memtest.org] sometime, leave running for a few hours, repeat for other computers you encounter, and see how many computers you need to try before you see it spit out errors. You might be surprised.

Re:I don't believe 1% of computers give wrong answ (5, Insightful)

MtHuurne (602934) | about 2 years ago | (#42332865)

He said 1% of computers that were used to play Guild Wars gave wrong answers. Gaming PCs are more likely to be overclocked too far, have under-dimensioned power supplies or overheating issues than the average PC. 1% doesn't sound unrealistically high to me.

Re:I don't believe 1% of computers give wrong answ (2)

Nyder (754090) | about 2 years ago | (#42333343)

He said 1% of computers that were used to play Guild Wars gave wrong answers. Gaming PCs are more likely to be overclocked too far, have under-dimensioned power supplies or overheating issues than the average PC. 1% doesn't sound unrealistically high to me.

guild wars runs okay on a crappy netbook, let alone anyones PC, If you need to OC to play Guild Wars then you are using a Pentium III.

Just saying.

I've fixed a lot of PC. In fact, i pride myself on the fact that I can usually figure out what is wrong with a computer, software or hardware. And I've seen some funky hardware in my time. And lost a lot of hardware going bad.

I bet more then 1% of computers are problemmatic and the owners don't know it. The last chunk of memory could be bad, but if they don't use all the memory, might never find that out. (or if they test it). They might dismissed random shutdowns without understanding there is a problem.

In my experience, there is a lot of computer hardware out there that is crappily made and shouldn't be sold, let alone in someone's computer.

Re:I don't believe 1% of computers give wrong answ (1)

gadzook33 (740455) | about 2 years ago | (#42333559)

I don't think this should be modded as flamebait. Personally I view the article with a similar degree of skepticism and incredulity. I have a good friend who works at a major chip manufacturer and specializes in fault detection. He related to me that, essentially they have never seen a case of an undetected (by the CPU) fault, despite running tests like this on massively huge systems. Between a game programmer and the company who makes their bread and butter doing this, I'm going to have to go with the latter until someone posts code or something more concrete than what I view as a lot of speculation.

stress test (1)

Anonymous Coward | about 2 years ago | (#42332665)

In my field, if you can survive a gcc (gcc.gnu.org) testsuite run, twice, and get the same answer, you have a verified good system. If not, you have a steaming pile of trash you should throw away. The begins, and ends all stress testing you need to do.

Re:stress test (5, Funny)

SJHillman (1966756) | about 2 years ago | (#42332729)

In my field, I have a bunch of grass, a few shrubs and even a small tree. Lots of rodents and birds. If a computer can survive two weeks sitting in my field and still power on, you have a damned good system. If not, you're left with people wondering why you left your computer in my field for two weeks.

Re:stress test (5, Funny)

AaronLS (1804210) | about 2 years ago | (#42332773)

He didn't say anything about a computer: "In my field, if YOU can survive"... scary...

Re:stress test (1)

AaronLS (1804210) | about 2 years ago | (#42332831)

Funny though, I like what you did there.

Re:stress test (1)

godrik (1287354) | about 2 years ago | (#42332799)

so do you suggest guildwars incorporate a gcc testsuite run in parallel with the game?

except if GCC is wrong (1)

decora (1710862) | about 2 years ago | (#42333163)

which, well, it can be.

Memory modules (0)

stanlyb (1839382) | about 2 years ago | (#42332767)

I found out his the hard way: by buying different DIMM modules, combining them, and of course NOT combining them. Nevertheless, when you do a lot of multitasking, play 2-3 games...ok, ok, only one, but the other two are still in background, having some VM, etc, you will find out how fragile the computers are nowadays. The solution? Buy some DELL, and be happy with the most stable, and with the least performance computer.

How to deal with compiler bugs (5, Insightful)

MtHuurne (602934) | about 2 years ago | (#42332779)

If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.

Re:How to deal with compiler bugs (0)

Anonymous Coward | about 2 years ago | (#42332881)

No kidding captain obvious, that's standard bug reporting 101.

http://www.chiark.greenend.org.uk/~sgtatham/bugs.html

How to lose time and sanity (4, Interesting)

Okian Warrior (537106) | about 2 years ago | (#42333089)

If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.

Yeah, right. Let's see how that works out in practice.

I go to the home page of the project with bug in hand (including sample code). Where do I log the problem?

I have to register with your site. One more external agent gets my E-mail, or I have to take pains to manage multiple E-mails to avoid spam. (I don't want to be part of your community! I just thought you wanted to make your product better.)

Once registered, I'm subscribed to your newsletter. (My temp E-mail has been getting status updates from the GCC crowd for years. My mail reader does something funky with the subject line, so responding with "unsubscribe" doesn't work for me.)

Once entered, my E-mail and/or name is publicly available on the bug report for the next millenium. In plain text in the bug report, and sometimes in the publicly-accessible changelog - naked for the world to see (CPAN is especially fragrant).

Some times the authors think it's the user's problem (no, really? This program causes gcc to core dump. How can that be *my* fault?) Some times the authors interpret the spec different from everyone else (Opera - I'm looking at you). Some times you're just ignored, some times they say "We're rewriting the core system, see if it's still there at the next release", and some times they say "it's fixed in the next release, should be available in 6 months".

What you really do is figure out the sequence of events that causes the problem, change the code to do the same thing in a different way (which *doesn't* trigger the error), and get on with your life. I've given up reporting bugs. It's a waste of time.

That's how you deal with compiler bugs: figure out how to get around them and get on with your work.

No, I'm not bitter...

Re:How to lose time and sanity (1)

BertieBaggio (944287) | about 2 years ago | (#42333577)

Once entered, my E-mail and/or name is publicly available on the bug report for the next millenium. In plain text in the bug report, and sometimes in the publicly-accessible changelog - naked for the world to see (CPAN is especially fragrant).

Well, at least it smells nice.

Compilers (4, Funny)

Mullen (14656) | about 2 years ago | (#42332847)

For being a skilled developer, I can't believe he would not think that Dev/Test/Prod build environments not running the same version of the compiler was not an issue (Obviously, until it was an issue).

That's Development Cycle 101.

Re:Compilers (0)

Anonymous Coward | about 2 years ago | (#42333071)

Its funny the amount of 'not my problem' that creeps into the dev process. You are working on your code. There is a guy in the corner his JOB is to make sure it builds. But only for dev. QA has their own build guy because the head of QA wants to build it himself because of 'process'. So now you have 2 builds. You have people who ignore known bugs because they 'do not want to rock the boat'. You have people shipping with ancient tools for years on end because 'upgrading is too big of a hassle'.

Then you have people thinking visual studio 6 is cutting edge and do not want to upgrade and only grudgingly up to 2002 (more well tested).

People are funny about dev tools. They think they somehow get better with age.

Re:Compilers (1)

disambiguated (1147551) | about 2 years ago | (#42333665)

Wow, I feel for you. If QA is not testing against the same build the developers are using, they're doing it horribly wrong. Or did you mean QA is doing their own build for their testing tools? That I can understand.

12 hours a day for weeks on end (2)

decora (1710862) | about 2 years ago | (#42333175)

i can't believe you don't understand that the brain doesn't work 100% reliably when you force it past the breaking point like this. its work 101.

Re:Compilers (0)

Anonymous Coward | about 2 years ago | (#42333197)

All screw-ups of this nature are totally obvious in hindsight.

This isn't a case of him saying "it doesn't matter whether the build server is up to date". This is a case of nobody remembering to update the build server. If they had been a large enough company to have a single person whose full-time job was to support build servers, it would have been that person's job, but I'm going to guess that Guild Wars was made by a relatively small team who wore multiple hats. The story certainly makes it sound that way. So it was, at the same time, everyone's fault and no-one's fault; anyone could have remembered to update the server, but no-one did, and no single person was uniquely in charge of maintaining the build server.

The ease of making mistakes is why very important jobs, like airplane pilot, rely so much on checklists. If there had been a "Development Cycle 101 Checklist" available to the Guild Wars team at the time of the story, I'm sure they would have used it, and when going down the checklist they would have said "oh, check that the build machine is up to date."

So, it's easy to laugh at them for this mistake, but I think of mistakes I have made and I don't look down on them for this one. He posted it so that others could learn from his experience rather than learning this the hard way.

QA fail (2)

Alwin Henseler (640539) | about 2 years ago | (#42333367)

Worse, the article hints at a bigger problem:

"We had "pushed" a new build out to end-users, and now none of them could play the game!"

Which I read as: developers write & debug code, that code goes through a build server which builds it & combines with game data etc, result of that is pushed to users. The obvious step missing here: make sure the exact same stuff you're pushing to users, is working & tested thoroughly before release. Seems like a gaping Quality Assurance fail right there, forget differences between developer and production systems.

Skip that step and you're implicitly assuming that correct code (like, what's known to work well on developer's system) will produce correct working end product. Even if developer's system and production systems are configured 100% the same, that assumption is still flawed: there's always the possibility of file corruption, eg. a random single-bit error that occurs somewhere during the build process, or anything else that goes into the end product which a developer doesn't check directly.

Of course it's best to make sure individual steps in the process are reliable, but whatever you do: at the very least check what you kick out the door. QA 101.

shows what happens pulling 12+ hour days does (0)

Joe_Dragon (2206452) | about 2 years ago | (#42332899)

shows what happens pulling 12+ hour days does at least some did not die due to it. Yes that has happened in the past with trains, trucks, airplanes , ect's.

Yep, seen it all (5, Insightful)

russotto (537200) | about 2 years ago | (#42332959)

I've had compilers miscompile my code, assemblers mis-assemble it, and even on a few cases CPUs mis-execute it consistently (look up CPU6 and msp430). Random crashes due to bad memory/cpu... yep. But on very rare occasions, I find that the bug is indeed in my own code, so I check there first.

Re:Yep, seen it all (1)

Belial6 (794905) | about 2 years ago | (#42333143)

I hate it when the bugs are not in my code.

Typical for safety cert programs (5, Interesting)

Okian Warrior (537106) | about 2 years ago | (#42332971)

We deal with this type of bug all the time in safety-certified systems (medical apps, aircraft, &c).

Most of the time an embedded program doesn't use up 100% of the CPU time. What can you do in the idle moments?

Each module supplies a function "xxxBIT" (where "BIT" stands for "Built In Test") which checks the module variables for consistency.

The serial driver (SerialBIT) checks that the buffer pointers still point within the buffer, checks that the serial port registers haven't changed, and so on.

The memory manager knows the last-used static address for the program (ie - the end of .data), and fills all unused memory with a pattern. In it's spare time (MemoryBIT) it checks to make sure the unused memory still has the pattern. This finds all sorts of "thrown pointer" errors. (Checking all of memory takes a long time, so MemoryBIT only checked 1K each call.)

The stack pointer was checked - we put a pattern at the end of the stack, and if it ever changed we knew something want recursive or used too much stack.

The EEPROM was checksummed periodically.

Every module had a BIT function and we check every imaginable error in the processor's spare time - over and over continuously.

Also, every function began with a set of ASSERTs that check the arguments for validity. These were active in the released code. The extra time spent was only significant in a handful of functions, so we removed the ASSERTs only in those cases. Overall the extra time spent was negligible.

The overall effect was a very "stiff" program - one that would either work completely or wouldn't work at all. In particular, it wouldn't give erroneous or misleading results: showing a blank screen is better than showing bad information, or even showing a frozen screen.

(Situation specific: Blank screen is OK for aircraft, but not medical. You can still detect errors, log the problem, and alert the user.)

Everyone says to only use error checking during development, and remove it on released code. I don't see it that way - done right, error checking has negligible impact, and coupled with good error logging it can turbocharge your bug-fixing.

More error checking (5, Interesting)

Okian Warrior (537106) | about 2 years ago | (#42333621)

My previous was modded up, so here's some more checks.

During boot, the system would execute a representative sample of CPU instructions, in order to test that the CPU wasn't damaged. Every mode of memory storage (ptr, ptr++, --ptr), add, subtract, multiply, divide, increment &c.

During boot, all memory was checked - not a burin-in test, just a quick check for integrity. The system wrote 0, 0xFF, A5, 5A and read the output back. This checked for wires shorted to ground/VCC, and wires shorted together.

During boot, the .bss segment was filled with a pattern, and as a rule, all programs were required to initialize all of their static variables. Each routine had an xxxINIT function which was called at boot. You could never assume a static variable was initialized to zero - this caught a lot of "uninitialized variable" errors.

(This allowed us to reboot specific systems without rebooting the system. Call the SerialINIT function, and don't worry about reinitializing that section's static vars.)

The program code was checksummed (1K at a time) continuously.

When filling memory, what pattern should you use? The theory was that any program using an uninitialized variable would crash immediately because of the pattern. 0xA5 is a good choice:

1) It's not 0, 1, or -1, which are common program constants.
2) It's not a printable character
3) It's a *really big* number (negative or unsigned), so array indexing should fail
4) It's not a valid floating point or double
5) Being odd, it's not a valid pointer

Whenever we use enums, we always start the first one at a different number; ie:

enum Day { Sat = 100, Sun, Mon... }
enum Month { Jan = 200, Feb, Mar, ... }

Note that the enums for Day aren't the same as Month, so if the program inadvertently stores one in the other, the program will crash. Also, the enums aren't small integers (ie - 0, 1, 2), which are used for lots of things in other places. Storing a zero in a Day will cause an error.

(This was easy to implement. Just grep for "enum" in the code, and ensure that each one starts on a different "hundred" (ie - one starts at 100, one starts at 200, and so on).)

The nice thing about safety cert is that the hardware engineer was completely into it as well. If there was any way for the CPU to test the hardware, he'd put it into the design.

You could loopback the serial port (ARINC on aircraft) to see if the transmitter hardware was working, you could switch the A/D converters to a voltage reference, he put resistors in the control switches so that we could test for broken wires, and so on.

(Recent Australian driver couldn't get his vehicle out of cruise-control because the on/off control wasn't working. He also couldn't turn the engine off (modern vehicle) nor shift to neutral (shift-by-wire). Hilarity ensued. Vehicle CPU should abort cruise control if it doesn't see a periodic heartbeat from the steering-wheel computer. But, I digress...)

If you're interested in the software safety systems, look up the Therac some time. Particularly, the analysis of the software bugs. Had the system been peppered with ASSERTs, no deaths would have occurred.

P.S. - If you happen to be building a safety cert system, I'm available to answer questions.

Stress testing: most critical Overclocking step! (2)

epyT-R (613989) | about 2 years ago | (#42332983)

This is why stress testing is so important. The system may seem stable at overclocked speeds but only while it is lightly or even moderately loaded, and not every error will result in a kernel panic. The hardest errors to get stable are often the subtle ones that cause cascades elsewhere, minutes or hours after the load finished.

I start by getting it stable enough to pass memtest86+ tests 5 and 7 at (or as close as possible) my target frequencies/dividers. This is pretty easy to do nowadays, but it's a good sanity check starting point before booting the OS and minimizes gross misconfigurations that cause filesystem corruption. Then I run prime95, then linpack, then y cruncher, then loops of a few 3dmark versions. Sometimes I run the number crunchers simultaneously across all cores, first configured to stress the cpu/cache, then with large sets to stress ram (but not swap! in fact turn swap off for this). The minimum time for all of this really should be 12 hrs.. 24 is best, or more if you're paranoid. A variety of loads over this time is important because the synthetic ones are often highly repetitious, and this can sometimes fail to expose problems despite the load the system's under. The 3dmark (or pick a scriptable util of your choice) stresses bus IO as well as all the really cranky and picky gfx driver code. As a unique stressor, I use a quake 3 map compile that eats most of the ram and pegs the cpu for hours.. q3map2 is a bitch and it usually finds those subtle 'non-fatal' hardware errors if they exist.

If the boot survives without an application or kernel crash (or other wonky behavior), I run a few games in timedemo loops. In the old days this was quake1/2/3, but these days I stick with games like metro 2033 which have their own bench utilities. these tests are still valid even if your intended use is for 'workstation' class work and don't game much, but still want to squeeze as much performance as you can from your hardware. I do both with mine and have had great success with this method.

Random bluescreens (0)

Anonymous Coward | about 2 years ago | (#42332999)

I get hilarious impossible errors on my gaming rig all the time. Memtest revealed my RAM is throwing errors about once in a couple million memory transactions. Leaky gate I guess. Not about to replace the RAM any time soon, though -- nobody is really making DDR2 anymore and the utterly random errors I get when data comes back incorrect is very instructive as to which companies practice sanity checking and which skip it. Oh, and how annoying it can be when windows decides device driver errors are worthy of halting the OS over.

Re:Mod Down (0)

Anonymous Coward | about 2 years ago | (#42333113)

Ok trolling is one thing, but what you wrote is a sick joke.

Re:Random bluescreens (2)

Just Brew It! (636086) | about 2 years ago | (#42333141)

If every value read from memory had to be sanity checked there would be little CPU horsepower left to perform useful work. Furthermore, if the bad value happens to be code instead of data, the application (or OS) is probably gonna crash before you even have a chance to check anything.

Don't forget to do out of the box testing (1)

Joe_Dragon (2206452) | about 2 years ago | (#42333009)

Don't forget to do out of the box testing / testing for stuff that you may not think of off hand.

Reminded me of my first C application (3, Interesting)

mykepredko (40154) | about 2 years ago | (#42333039)

I can't remember the exact code sequence, but in a loop, I had the statement:

if (i = 1) {

Where "i" was the loop counter.

Most of the time, the code would work properly as other conditions would take program execution but every once in a while the loop would continue indefinitely.

I finally decided to look at the assembly code and discovered that in the conditional statement, I was setting the loop counter to 1 which was keeping it from executing.

I'm proud to say that my solution to preventing this from happening is to never place a literal last in a condition, instead it always goes first like:

if (1 = i) {

So the compiler can flag the error.

I'm still amazed at how rarely this trick is not taught in programming classes and how many programmers it still trips up.

myke

Re:Reminded me of my first C application (0)

Anonymous Coward | about 2 years ago | (#42333085)

how would "assign 1 to value i" ever return false?

Re:Reminded me of my first C application (0)

Anonymous Coward | about 2 years ago | (#42333207)

how would "assign 1 to value i" ever return false?

It wouldn't, but partly because it wouldn't compile.

Re:Reminded me of my first C application (1, Informative)

mykepredko (40154) | about 2 years ago | (#42333241)

The statement:

if (i = 1) {

is equivalent to:

i = i;
if (i) { // Always true because i = 1 and i != 0

myke

Re:Reminded me of my first C application (5, Informative)

richardcavell (694686) | about 2 years ago | (#42333413)

I just want to correct this, not to prove how smart I am but because there are novice programmers out there who will learn from this case. The statement:

if (i = 1) {

is equivalent to:

i = 1; /* correction */
if (i) {

Re:Reminded me of my first C application (0)

Anonymous Coward | about 2 years ago | (#42333283)

Took me a second too, but what he was saying is that he had an error that the compiler didn't flag because the syntax was correct. By putting the literal first, it then becomes an lvalue, which is illegal syntax and thus will cause the compiler to throw an error.

Re:Reminded me of my first C application (0)

Anonymous Coward | about 2 years ago | (#42333121)

One of the reasons why C++, C# and Java use the "==" operand to do Boolean evaluation. AFAIK, most of the modern implementations of C compilers prefer this as well, to prevent Boolean evaluations from becoming assignments. The only time I really remember encountering this issue was decades ago when I was still learning to program BASIC on my VIC 20, and C for me was still just the third letter in the alphabet.

another trick: stop mixing up testing + assignment (1)

decora (1710862) | about 2 years ago | (#42333203)

anything that both assigns and tests a loop index is by definition a fucking accident waiting to happen. its like driving a car without wearing a seatbelt and then deciding the 'solution' is to put the steering wheel in the back seat instead of the front.

Re:another trick: stop mixing up testing + assignm (1)

mykepredko (40154) | about 2 years ago | (#42333225)

I agree, but

if (i = 1) {

is a perfectly valid "C" (and Java) statement - there was no intention of putting an assignment in a conditional statement.

Modern compilers now issue warnings on statements like this, but at the time nothing was returned.

myke

Re:another trick: stop mixing up testing + assignm (0)

Anonymous Coward | about 2 years ago | (#42333649)

It's not valid Java. Java doesn't allow assignment inside a conditional or loop statement.

Re:Reminded me of my first C application (4, Informative)

safetyinnumbers (1770570) | about 2 years ago | (#42333213)

That's known as "Yoda style" [codinghorror.com]

Re:Reminded me of my first C application (1)

mykepredko (40154) | about 2 years ago | (#42333301)

The reference, like, do I.

myke

Re:Reminded me of my first C application (1)

dido (9125) | about 2 years ago | (#42333363)

Which is why I always compile with -Wall -Werror on gcc. I get: "warning: suggest parentheses around assignment used as truth value [-Wparentheses]" for code which looks like that. I consider code that generates compiler warnings as being a bad sign, and always make it a point to clean them up before considering any code suitable. I don't know why this doesn't seem to be as widely done as it should be.

Re:Reminded me of my first C application (0)

Anonymous Coward | about 2 years ago | (#42333379)

Most modern IDEs are smart enough to flag this visually, and can usually be configured to flag this at an "error" level. What should concern you is the lack of interest shown by most developers in making the best use of their IDE.

If you're a die-hard VI or Emacs dude... well, there really is no hope then, is there?

Confusing poing in parent article (1)

mykepredko (40154) | about 2 years ago | (#42333403)

Hiya,

I just noticed the confusion when I put in "keeping it from executing" when I meant to say "keeping it from exiting [the loop]".

Sorry about that,

myke

Re:Confusing poing in parent article (1)

mykepredko (40154) | about 2 years ago | (#42333411)

Should I even note that I misspelled "point"?

myke

Re:Reminded me of my first C application (1)

MichaelSmith (789609) | about 2 years ago | (#42333449)

if ( 1 == i ) {

Re:Reminded me of my first C application (0)

Anonymous Coward | about 2 years ago | (#42333673)

if ( 1 == i ) {

What type of error/warning does that result in?

Re:Reminded me of my first C application (1)

Bill Currie (487) | about 2 years ago | (#42333687)

Ick, that's what real compilers (eg, gcc) are for: good warning messages (such as "suggest parenthesis around assignment used as truth value"), and better yet, -Werror. "if (1 == i)" is completely unnatural (for an English speaker anyway), which makes it more likely to forget to do 1 == i than it is to forget to double the equals sign. I too used to make the same mistake when I first started with C (having come from Pascal: that was fun := became =, = became ==), but I quickly learned to double check my tests first when bug hunting. While mixing up = and == has become extremely rare for me (it helps that I usually test against some variant of 0 and thus can avoid using any operator other than !), I often mix up my other tests...

Bugs A Noy (1)

ios and web coder (2552484) | about 2 years ago | (#42333091)

In my own coding, I tend to *gasp* make mistakes. Sometimes, really, really dumb ones.

One of the biggest problems with my coding, is that I am often the only real coder looking at it. Even my FOSS work seldom gets reviewed by coders.

I can't say enough about peer review. I wish I had more. It can really suck, as one thing that geeks LOVE to do, is cut down other geeks. However, they are sometimes right, and should be heard.

Negative feedback makes the product better. Positive feedback makes the producer feel better.

I prefer a better product, but that's just me.

I had an interesting bug just the other day in my FOSS project. It's an iOS (iPhone/iPad) app that uses the MapKit Framework API [apple.com] .

The bug was on this line [github.com] .

The original code is here [github.com] .

So that folks don't have to look at a whole bunch of source, here's the problematic two lines:

[mapSearchView addAnnotation:myMarker]; [mapSearchView setDelegate:self];

When iOS 6 came out (with Apple's...wonderful...new maps), the black marker suddenly started showing as the default marker (this only works on iPads, so no one seemed to see it).

I went nuts trying to figure it out (actually, I've been nuts for a long time, but now I have something to blame it on).

I traced into the callbacks, and saw that they were being called with an empty annotation. Whiskey Tango Foxtrot?

Then, just for s's and g's, tried this:

[mapSearchView setDelegate:self]; [mapSearchView addAnnotation:myMarker];

Damn if that didn't fix it.

It was a case of an ambiguous API contract. The Apple maps call the annotation setup as soon as the annotation is set, and the old Google API waited until a few things were set up, so the delegate call set after the annotation worked.

I could rail against the framework, but it was really my own fault, and I am just glad I figgered it out.

What? (-1)

Anonymous Coward | about 2 years ago | (#42333099)

Our programming team could spend weeks researching the bugs for just one day at that rate!'

English motherfucker, do you speak it?

Yes, hardware errors happen! (1)

Just Brew It! (636086) | about 2 years ago | (#42333177)

It's just a matter of whether you realize it or not.

The blatant ones cause an application or OS crash. But depending on what got corrupted, it might just cause a momentary application glitch, or even cause an alteration in the contents of a file that you won't notice for weeks... if ever.

When I build PCs, they get an overnight Memtest run at a minimum. Most of the time I also use ECC RAM to protect against random flipped bits and DIMMs that fail after being in use for a while.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?