Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Bug Programming Games

Whose Bug Is This Anyway? 241

An anonymous reader writes "Patrick Wyatt, one of the developers behind the original Warcraft and StarCraft games, as well as Diablo and Guild Wars, has a post about some of the bug hunting he's done throughout his career. He covers familiar topics — crunch time leading to stupid mistakes and finding bugs in compilers rather than game code — and shares a story about finding a way to diagnose hardware failure for players of Guild Wars. Quoting: '[Mike O'Brien] wrote a module ("OsStress") which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second. On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!'"
This discussion has been archived. No new comments can be posted.

Whose Bug Is This Anyway?

Comments Filter:
  • The memory thing... (Score:5, Informative)

    by Loopy ( 41728 ) on Tuesday December 18, 2012 @09:15PM (#42332529) Journal

    ...is pretty much what those of us that build our own systems do anytime we upgrade components (RAM/CPU/MB) or experience unexplained errors. It's similar to running the Prime95 torture tests overnight, which also checks calculations in memory against known data sets for expected values.

    Good stuff for those that don't already have a knack for QA.

  • by Anonymous Coward on Tuesday December 18, 2012 @09:16PM (#42332533)

    You mean all those times when my code was 'fine' and i gave up it really could have been the compiler or a memory problem

    shit i'm a much better programmer than i realized

    • I started bringing my personal laptop to my programming classes for a simple reason. About 20% (seemed like 65%, but that's probably just a trick of memory)of the class computers had their compilers borked by another student and any particular time. You had no idea that somebody had put in a weird setting in the compiler, or had just outright broke something, until after you'd done way too much troubleshooting. I found I got a whole lot more done on my personal box that nobody else could mess up. :)
      • by hazah ( 807503 )
        Are you talking about actual compilers or the IDE?
      • by caseih ( 160668 )

        Wow you had really crappy computer installations to work with. In my labs, gcc was in /usr/bin and I didn't have any write permission to that directory at all to mess things up. gcc and make just always worked for me.

    • by disambiguated ( 1147551 ) on Tuesday December 18, 2012 @11:58PM (#42333499)
      You're a better programmer for assuming it's not a compiler bug and trying harder to figure out what you did wrong.

      I've been programming professionally for over 20 years, mostly in C/C++ (MSVC, GCC, and recently CLang (and others back in the olden days)). I've seen maybe two serious compiler bugs in the past 10 years. They used to be common.

      On the other hand, I can't count how many times I've seen coders insist there must be a compiler bug when after investigation, the compiler had done exactly what it should according to the standard (or according to the compiler vendor's documentation when the compiler intentionally deviated from the standard).

      By "serious", I mean the compiler itself doesn't crash, issues no warnings or errors, but generates incorrect code. Maybe I've just been lucky. (Or maybe QA just never found them ;-)

      Oh, and btw, yes I realize you were joking (and I found it funny.)
      • by Roger W Moore ( 538166 ) on Wednesday December 19, 2012 @12:15AM (#42333587) Journal

        the compiler had done exactly what it should according to the standard...

        That's even better - it means that you've found a bug in the standard! ;-)

        • So true. It's a dysfunctional love/hate relationship I have with the C++ standard. And just like most abusive relationships, I refuse to leave her. :)

          I wish D [dlang.org] would gain some momentum.
      • by tlhIngan ( 30335 ) <slashdot.worf@net> on Wednesday December 19, 2012 @02:00AM (#42334051)

        By "serious", I mean the compiler itself doesn't crash, issues no warnings or errors, but generates incorrect code. Maybe I've just been lucky. (Or maybe QA just never found them ;-)

        I saw this once - took me weeks to solve it. Basically I had a flash driver that would occasionally erase the boot block (bad!). It was odd because we had protected the boot block both in the higher level OS as well as the code itself.

        Well, it happened and I ended up tracing through the assembly code - it turned out the optimizer worked a bit TOO well - it completely optimized out a macro call used to translate between parameters (the function to erase the block required a sector number. The OS called with the block number, so a simple multiplication was needed to convert). End result, the checks worked fine, but because the multiplication never happened, it erased the wrong block. (The erase code erased the block a sector belonged to - so sectors 0, 1, ... NUM_SECTORS_PER_BLOCK-1 erased the first block).

        A little #pragma to disable optimizations on that one function and the bug was fixed.

  • OsStress (Score:5, Informative)

    by larry bagina ( 561269 ) on Tuesday December 18, 2012 @09:20PM (#42332567) Journal
    Microsoft found similar [msdn.com] impossible bugs when overclocking was involved.
    • Re: (Score:2, Informative)

      by Anonymous Coward

      That's not too surprising. For instance if you try to read too fast from memory, the data you read may not be what was actually in the memory location. Some bits may be correct, some may not. Sometimes the incorrect values may relate to the data that was on the bus last cycle, eg there has not been enough time for the change to propagate through. This can easily lead to the data apparently read being a value that should not be possible. This is why overclocking is not a good idea for mission critical system

    • Re: (Score:2, Insightful)

      We all realize that when Intel bakes a bunch of processors, they come out all the same, and then Intel labels some as highspeed, some as middle, and some as low. They are then sold for different prices. However, they are the exact same CPU.

      Overclocking isn't the issue, because the CPUs are the same. The problem arises when aggressive overclocking is done by ignorant hobbyists or money-grubbing computer retailers. They overclock the computer to where it crashes, and then back off just a little bit. "The

      • Re:OsStress (Score:5, Insightful)

        by Anonymous Coward on Tuesday December 18, 2012 @10:48PM (#42333049)

        Bullshit. While Intel does occasionally bin processors into lower speeds to fulfill quotas and such, often times those processors are binned lower because they can't pass the QA process at their full speed. But they can pass the QA process when running at a lower speed. These processors were meant to be the same as the more expensive line, but due to minor defects can't run stably or reliably at the higher speed. Or at least not enough for Intel to sell them at full speed.

        Which is a large part of why some processors in the same batch can handle it when others can't.

        As much as I hate Intel, I think we could at least realize that they are often times doing this with good reason.

        • Nope! It's the same processor. Sure, some come out different, but oftentimes there are loads of perfectly good processors that get underclocked for marketing reasons only. It's not like the ratios come out perfectly every time, which is what you seem to be implying. They often times don't do it with good reason. Intel is very big into marketing. If they were an engineering firm, they'd sell one product at one price and be done with it.
          • The problem is that you can't really test an overclocked CPU. CPUs do not work perfectly or catastrophically fail (crash the software). There is a range of errors between these two extremes. Some of the overclocking induced errors are quite subtle, a simple incorrect answer. The problem is that these subtle errors are sometimes dependent upon a certain sequence of instructions or certain data patterns and the sequences and data, as well as the failing instruction and the clock speed where these errors manif
          • Re:OsStress (Score:4, Informative)

            by epine ( 68316 ) on Wednesday December 19, 2012 @05:48AM (#42334983)

            Nope! It's the same processor. Sure, some come out different, but oftentimes there are loads of perfectly good processors that get underclocked for marketing reasons only.

            When the day arrives that we achieve molecular assembly, even then for two devices identically assembled with atom for atom correspondence, there will likely be enough variation in molecular or crystaline conformation remaining to classify the two devices at the margin as "not quite the same".

            Binning levels are determined by the weakest transistor out of billions, the one with a gate thickness three deviations below the mean, and a junction length a deviation above. There is probably some facility for defective block substitution at the level of on-chip SRAM (cache memory), and maybe you can laser out an entirely defective core or two.

            As production ramps, Intel has a rough model of how the binning will play out, but this is a constantly moving target. Meanwhile, marketting is making promises to the channel on prices and volumes at the various tiers. There's no sane way to do this without sometimes shifting chips down a grade from the highest level of validation in order to meet your promises at all levels despite ripples experienced in actual production.

            Intel is also concerned--for good reason--about dishonest remarking in the channel. There's huge profit in it, and it comes mainly at the expense of Intel's reputation. Multiplier locks help to discourage this kind of shady business practice. So yeah, a few chips do get locked into a speed grade less than the chip could feasibly achieve. This is all common sense from gizzard to gullet. What's your point, then?

            If they were an engineering firm, they'd sell one product at one price and be done with it.

            Where you even find so many stupid engineers? The College of Engineering for Engineers Who Think Statistics is One Big Cosmic Joke presided over by the Edwin J. Goodwin [wikipedia.org] Chair of Defining Pi As Equal to 22/7?

      • Re: (Score:2, Informative)

        by Anonymous Coward

        We all realize that when Intel bakes a bunch of processors, they come out all the same, and then Intel labels some as highspeed, some as middle, and some as low. They are then sold for different prices. However, they are the exact same CPU.

        This is not 100% correct. When Intel or other fabricators of microprocessors make the things they do use the same "mold" to "stamp out" all the processors in large bunches all based on the same design, however they don't get the exact same result each time. The little difference from chip to chip, like on this chip some transistors ended up a few atoms closer together than what is the optimum distance so this part of the processor now will heat up more when in use, or on this chip someone coughed* during th

    • Re:OsStress (Score:5, Interesting)

      by Anonymous Coward on Tuesday December 18, 2012 @10:33PM (#42332985)

      Then again, it might not be overclocking after all [msdn.com].

      More relevantly, Microsoft has access to an enormous wealth of data about hardware failures from Windows Error Reporting. This paper [microsoft.com] has some fascinating data in it:

      - Machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault
      - Machines that crashed once had a probability of 1 in 3.3 of crashing a second time
      - The probability of a hard disk failure in the first 5 days of uptime is 1 in 470
      - Once you've had one hard disk failure, the probability of a second failure is 1 in 3.4
      - Once you've had two failures, the probability of a third failure is 1 in 1.9

      Conclusion: When you get a hard disk failure, replace the drive immediately.

  • Caution: (Score:5, Funny)

    by fahrbot-bot ( 874524 ) on Tuesday December 18, 2012 @09:25PM (#42332597)
    Bug hunts on LV-426 [wikipedia.org] often end badly.
    • Its them damn cosmic rays, I tell ya.

      The death of Moore's law, they will be.

      • Its them damn cosmic rays, I tell ya.

        The death of Moore's law, they will be.

        Or the reason semiconductor houses switch from conventional (bulk CMOS) processes to Silicon-on-Insulator [wikipedia.org]. Many SOI processes are rad hardened by default.

        • Many SOI processes are rad hardened by default.

          Rad hard usually means that they are not damaged by radiation e.g. you can stick them close to an LHC beam as part of a detector and the massive radiation dose they receive will not cause the device to permanently cease functioning (or at least last longer before it fails). On the other hand cosmic rays which slow down and stop in material can cause a large amount of local ionization. This can be enough to flip the state of a memory bit which can cause crashes. As devices get smaller the charge needed to f

  • by MtHuurne ( 602934 ) on Tuesday December 18, 2012 @09:52PM (#42332779) Homepage

    If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.

    • by Okian Warrior ( 537106 ) on Tuesday December 18, 2012 @10:57PM (#42333089) Homepage Journal

      If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.

      Yeah, right. Let's see how that works out in practice.

      I go to the home page of the project with bug in hand (including sample code). Where do I log the problem?

      I have to register with your site. One more external agent gets my E-mail, or I have to take pains to manage multiple E-mails to avoid spam. (I don't want to be part of your community! I just thought you wanted to make your product better.)

      Once registered, I'm subscribed to your newsletter. (My temp E-mail has been getting status updates from the GCC crowd for years. My mail reader does something funky with the subject line, so responding with "unsubscribe" doesn't work for me.)

      Once entered, my E-mail and/or name is publicly available on the bug report for the next millenium. In plain text in the bug report, and sometimes in the publicly-accessible changelog - naked for the world to see (CPAN is especially fragrant).

      Some times the authors think it's the user's problem (no, really? This program causes gcc to core dump. How can that be *my* fault?) Some times the authors interpret the spec different from everyone else (Opera - I'm looking at you). Some times you're just ignored, some times they say "We're rewriting the core system, see if it's still there at the next release", and some times they say "it's fixed in the next release, should be available in 6 months".

      What you really do is figure out the sequence of events that causes the problem, change the code to do the same thing in a different way (which *doesn't* trigger the error), and get on with your life. I've given up reporting bugs. It's a waste of time.

      That's how you deal with compiler bugs: figure out how to get around them and get on with your work.

      No, I'm not bitter...

      • Once entered, my E-mail and/or name is publicly available on the bug report for the next millenium. In plain text in the bug report, and sometimes in the publicly-accessible changelog - naked for the world to see (CPAN is especially fragrant).

        Well, at least it smells nice.

      • by kwerle ( 39371 )

        ...I have to register with your site. One more external agent gets my E-mail, or I have to take pains to manage multiple E-mails to avoid spam. (I don't want to be part of your community! I just thought you wanted to make your product better.)...

        Let me help with one aspect.

        If your email address is:
        your_address@gmail.com
        then you supply
        your_address+domain.name@gmail.com

        And if you don't use gmail, then maybe your email supplier does something similar. Or you should learn procmail if you're still managing your own.

        p.s. It looks like your www.o...r.com domain/host is down.

    • I've only seen a compiler issue once in my entire career. Granted, I have probably worked on more mature platforms than some of the cowboys that started the industry but here is what it was: It was using GCC 3 on MontaVista 2.4 Linux. I was building an insanely complex system library w/ a lot of templated code and a large number of classes inheriting from a parent. The problem is that a pthread_mutex_t was not being constructed properly. When we would try and lock the mutex, it would crash the entire a
  • Compilers (Score:5, Funny)

    by Mullen ( 14656 ) on Tuesday December 18, 2012 @10:05PM (#42332847)

    For being a skilled developer, I can't believe he would not think that Dev/Test/Prod build environments not running the same version of the compiler was not an issue (Obviously, until it was an issue).

    That's Development Cycle 101.

    • i can't believe you don't understand that the brain doesn't work 100% reliably when you force it past the breaking point like this. its work 101.

    • Worse, the article hints at a bigger problem:

      "We had "pushed" a new build out to end-users, and now none of them could play the game!"

      Which I read as: developers write & debug code, that code goes through a build server which builds it & combines with game data etc, result of that is pushed to users. The obvious step missing here: make sure the exact same stuff you're pushing to users, is working & tested thoroughly before release. Seems like a gaping Quality Assurance fail right there, forget differences between developer and production systems.

      Skip that step and you're implic

  • Yep, seen it all (Score:5, Insightful)

    by russotto ( 537200 ) on Tuesday December 18, 2012 @10:26PM (#42332959) Journal

    I've had compilers miscompile my code, assemblers mis-assemble it, and even on a few cases CPUs mis-execute it consistently (look up CPU6 and msp430). Random crashes due to bad memory/cpu... yep. But on very rare occasions, I find that the bug is indeed in my own code, so I check there first.

  • by Okian Warrior ( 537106 ) on Tuesday December 18, 2012 @10:32PM (#42332971) Homepage Journal

    We deal with this type of bug all the time in safety-certified systems (medical apps, aircraft, &c).

    Most of the time an embedded program doesn't use up 100% of the CPU time. What can you do in the idle moments?

    Each module supplies a function "xxxBIT" (where "BIT" stands for "Built In Test") which checks the module variables for consistency.

    The serial driver (SerialBIT) checks that the buffer pointers still point within the buffer, checks that the serial port registers haven't changed, and so on.

    The memory manager knows the last-used static address for the program (ie - the end of .data), and fills all unused memory with a pattern. In it's spare time (MemoryBIT) it checks to make sure the unused memory still has the pattern. This finds all sorts of "thrown pointer" errors. (Checking all of memory takes a long time, so MemoryBIT only checked 1K each call.)

    The stack pointer was checked - we put a pattern at the end of the stack, and if it ever changed we knew something want recursive or used too much stack.

    The EEPROM was checksummed periodically.

    Every module had a BIT function and we check every imaginable error in the processor's spare time - over and over continuously.

    Also, every function began with a set of ASSERTs that check the arguments for validity. These were active in the released code. The extra time spent was only significant in a handful of functions, so we removed the ASSERTs only in those cases. Overall the extra time spent was negligible.

    The overall effect was a very "stiff" program - one that would either work completely or wouldn't work at all. In particular, it wouldn't give erroneous or misleading results: showing a blank screen is better than showing bad information, or even showing a frozen screen.

    (Situation specific: Blank screen is OK for aircraft, but not medical. You can still detect errors, log the problem, and alert the user.)

    Everyone says to only use error checking during development, and remove it on released code. I don't see it that way - done right, error checking has negligible impact, and coupled with good error logging it can turbocharge your bug-fixing.

    • More error checking (Score:5, Interesting)

      by Okian Warrior ( 537106 ) on Wednesday December 19, 2012 @12:21AM (#42333621) Homepage Journal

      My previous was modded up, so here's some more checks.

      During boot, the system would execute a representative sample of CPU instructions, in order to test that the CPU wasn't damaged. Every mode of memory storage (ptr, ptr++, --ptr), add, subtract, multiply, divide, increment &c.

      During boot, all memory was checked - not a burin-in test, just a quick check for integrity. The system wrote 0, 0xFF, A5, 5A and read the output back. This checked for wires shorted to ground/VCC, and wires shorted together.

      During boot, the .bss segment was filled with a pattern, and as a rule, all programs were required to initialize all of their static variables. Each routine had an xxxINIT function which was called at boot. You could never assume a static variable was initialized to zero - this caught a lot of "uninitialized variable" errors.

      (This allowed us to reboot specific systems without rebooting the system. Call the SerialINIT function, and don't worry about reinitializing that section's static vars.)

      The program code was checksummed (1K at a time) continuously.

      When filling memory, what pattern should you use? The theory was that any program using an uninitialized variable would crash immediately because of the pattern. 0xA5 is a good choice:

      1) It's not 0, 1, or -1, which are common program constants.
      2) It's not a printable character
      3) It's a *really big* number (negative or unsigned), so array indexing should fail
      4) It's not a valid floating point or double
      5) Being odd, it's not a valid pointer

      Whenever we use enums, we always start the first one at a different number; ie:

      enum Day { Sat = 100, Sun, Mon... }
      enum Month { Jan = 200, Feb, Mar, ... }

      Note that the enums for Day aren't the same as Month, so if the program inadvertently stores one in the other, the program will crash. Also, the enums aren't small integers (ie - 0, 1, 2), which are used for lots of things in other places. Storing a zero in a Day will cause an error.

      (This was easy to implement. Just grep for "enum" in the code, and ensure that each one starts on a different "hundred" (ie - one starts at 100, one starts at 200, and so on).)

      The nice thing about safety cert is that the hardware engineer was completely into it as well. If there was any way for the CPU to test the hardware, he'd put it into the design.

      You could loopback the serial port (ARINC on aircraft) to see if the transmitter hardware was working, you could switch the A/D converters to a voltage reference, he put resistors in the control switches so that we could test for broken wires, and so on.

      (Recent Australian driver couldn't get his vehicle out of cruise-control because the on/off control wasn't working. He also couldn't turn the engine off (modern vehicle) nor shift to neutral (shift-by-wire). Hilarity ensued. Vehicle CPU should abort cruise control if it doesn't see a periodic heartbeat from the steering-wheel computer. But, I digress...)

      If you're interested in the software safety systems, look up the Therac some time. Particularly, the analysis of the software bugs. Had the system been peppered with ASSERTs, no deaths would have occurred.

      P.S. - If you happen to be building a safety cert system, I'm available to answer questions.

    • > Everyone says to only use error checking during development, and remove it on released code. I don't see it that way - done right, error checking has negligible impact,

      That depends on the _type_ of app. In the games industry a _debug_ build runs TOO slow to be even practical. You are forced to run optimized code if you want to have any hope of going above 1 fps.

      TINSFAAFL. Error checking costs. If I was doing software were somebody's life depended on it -- hell yeah you spot on! But for a "game" yo

  • by epyT-R ( 613989 ) on Tuesday December 18, 2012 @10:33PM (#42332983)

    This is why stress testing is so important. The system may seem stable at overclocked speeds but only while it is lightly or even moderately loaded, and not every error will result in a kernel panic. The hardest errors to get stable are often the subtle ones that cause cascades elsewhere, minutes or hours after the load finished.

    I start by getting it stable enough to pass memtest86+ tests 5 and 7 at (or as close as possible) my target frequencies/dividers. This is pretty easy to do nowadays, but it's a good sanity check starting point before booting the OS and minimizes gross misconfigurations that cause filesystem corruption. Then I run prime95, then linpack, then y cruncher, then loops of a few 3dmark versions. Sometimes I run the number crunchers simultaneously across all cores, first configured to stress the cpu/cache, then with large sets to stress ram (but not swap! in fact turn swap off for this). The minimum time for all of this really should be 12 hrs.. 24 is best, or more if you're paranoid. A variety of loads over this time is important because the synthetic ones are often highly repetitious, and this can sometimes fail to expose problems despite the load the system's under. The 3dmark (or pick a scriptable util of your choice) stresses bus IO as well as all the really cranky and picky gfx driver code. As a unique stressor, I use a quake 3 map compile that eats most of the ram and pegs the cpu for hours.. q3map2 is a bitch and it usually finds those subtle 'non-fatal' hardware errors if they exist.

    If the boot survives without an application or kernel crash (or other wonky behavior), I run a few games in timedemo loops. In the old days this was quake1/2/3, but these days I stick with games like metro 2033 which have their own bench utilities. these tests are still valid even if your intended use is for 'workstation' class work and don't game much, but still want to squeeze as much performance as you can from your hardware. I do both with mine and have had great success with this method.

  • Don't forget to do out of the box testing / testing for stuff that you may not think of off hand.

  • by mykepredko ( 40154 ) on Tuesday December 18, 2012 @10:45PM (#42333039) Homepage

    I can't remember the exact code sequence, but in a loop, I had the statement:

    if (i = 1) {

    Where "i" was the loop counter.

    Most of the time, the code would work properly as other conditions would take program execution but every once in a while the loop would continue indefinitely.

    I finally decided to look at the assembly code and discovered that in the conditional statement, I was setting the loop counter to 1 which was keeping it from executing.

    I'm proud to say that my solution to preventing this from happening is to never place a literal last in a condition, instead it always goes first like:

    if (1 = i) {

    So the compiler can flag the error.

    I'm still amazed at how rarely this trick is not taught in programming classes and how many programmers it still trips up.

    myke

    • anything that both assigns and tests a loop index is by definition a fucking accident waiting to happen. its like driving a car without wearing a seatbelt and then deciding the 'solution' is to put the steering wheel in the back seat instead of the front.

      • I agree, but

        if (i = 1) {

        is a perfectly valid "C" (and Java) statement - there was no intention of putting an assignment in a conditional statement.

        Modern compilers now issue warnings on statements like this, but at the time nothing was returned.

        myke

      • Excuse me but...WHOOSH...
        The GP's tip is about avoiding accidentally inserting an assignment operator "=" when you intended to put an equivalence operator "==". If you follow the GP's habit there can be no such accidents because it's illegal to assign a value to a constant. The coding tip itself has been circulating for 23yrs that I personally know about, in those days compilers did not warn you about assignments within conditionals, the compiler saw valid syntax and assumed you knew what you were doing.
    • by safetyinnumbers ( 1770570 ) on Tuesday December 18, 2012 @11:15PM (#42333213)
      That's known as "Yoda style" [codinghorror.com]
    • by dido ( 9125 )

      Which is why I always compile with -Wall -Werror on gcc. I get: "warning: suggest parentheses around assignment used as truth value [-Wparentheses]" for code which looks like that. I consider code that generates compiler warnings as being a bad sign, and always make it a point to clean them up before considering any code suitable. I don't know why this doesn't seem to be as widely done as it should be.

    • Hiya,

      I just noticed the confusion when I put in "keeping it from executing" when I meant to say "keeping it from exiting [the loop]".

      Sorry about that,

      myke

    • if ( 1 == i ) {

    • Ick, that's what real compilers (eg, gcc) are for: good warning messages (such as "suggest parenthesis around assignment used as truth value"), and better yet, -Werror. "if (1 == i)" is completely unnatural (for an English speaker anyway), which makes it more likely to forget to do 1 == i than it is to forget to double the equals sign. I too used to make the same mistake when I first started with C (having come from Pascal: that was fun := became =, = became ==), but I quickly learned to double check my tes

  • In my own coding, I tend to *gasp* make mistakes. Sometimes, really, really dumb ones.

    One of the biggest problems with my coding, is that I am often the only real coder looking at it. Even my FOSS work seldom gets reviewed by coders.

    I can't say enough about peer review. I wish I had more. It can really suck, as one thing that geeks LOVE to do, is cut down other geeks. However, they are sometimes right, and should be heard.

    Negative feedback makes the product better. Positive feedback makes the producer feel

  • It's just a matter of whether you realize it or not.

    The blatant ones cause an application or OS crash. But depending on what got corrupted, it might just cause a momentary application glitch, or even cause an alteration in the contents of a file that you won't notice for weeks... if ever.

    When I build PCs, they get an overnight Memtest run at a minimum. Most of the time I also use ECC RAM to protect against random flipped bits and DIMMs that fail after being in use for a while.

  • by Josh Coalson ( 538042 ) on Wednesday December 19, 2012 @04:12AM (#42334611) Homepage
    I used to get bug reports for FLAC caused by this very same problem.

    FLAC has a verify mode when encoding which, in parallel, decodes the encoded output and compares it against the original input to make sure they're identical. Every once in a while I'd get a report that there were verification failures, implying FLAC had a bug.

    If it were actually a FLAC bug, the error would be repeatable* (same error in the same place) because the algorithm is deterministic, but upon rerunning the exact same command the users would get no error, or (rarely) an error in a different place. Then they'd run some other hardware checker and find the real problem.

    Turns out FLAC encoding is also a nice little hardware stressor.

    (* Pedants: yes, there could be some pseudo-random memory corruption, etc but that never turned out to be the case. PS I love valgrind.)

    • FLAC has a verify mode when encoding which, in parallel, decodes the encoded output and compares it against the original input to make sure they're identical. Every once in a while I'd get a report that there were verification failures, implying FLAC had a bug.

      While hardware could definitely be at fault, have you considered the possibility of a race condition causing the error? Race conditions may occur very infrequently and can be incredibly difficult to discover. I'm not trying to disparage the FLAC co

A morsel of genuine history is a thing so rare as to be always valuable. -- Thomas Jefferson

Working...