Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon

Soulskill posted more than 2 years ago | from the it's-not-me-it's-you dept.

AMD 292

An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."

cancel ×

292 comments

Sorry! There are no comments related to the filter you selected.

Sickening (-1)

Anonymous Coward | more than 2 years ago | (#39257853)

What a bunch of freaking homosexual gay-babies!

Gayin' up the place!

another horrible cpu bug (-1)

Anonymous Coward | more than 2 years ago | (#39257865)

And who reveals it? Open source. Clearly it is the threat to the integrity of our Taiwanese-made American CPUs!

Re:another horrible cpu bug (4, Insightful)

Taco Cowboy (5327) | more than 2 years ago | (#39257957)

What has Taiwan got to do with this ?

I mean, was the CPU bug somehow introduced by TSMC ?

Re:another horrible cpu bug (1)

justforgetme (1814588) | more than 2 years ago | (#39258187)

Ohh, I'm sure AMD will want you to believe that :-)

This isn't nearly as bad as the division bug (4, Insightful)

Omnifarious (11933) | more than 2 years ago | (#39257895)

Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.

Re:This isn't nearly as bad as the division bug (4, Insightful)

XDirtypunkX (1290358) | more than 2 years ago | (#39257977)

Either are equally bad from the perspective of a software developer who spends a month trying to work out just exactly what is wrong with their code, especially if something like this occurs on a test machine but not on a development machine.

Re:This isn't nearly as bad as the division bug (4, Funny)

GoodNewsJimDotCom (2244874) | more than 2 years ago | (#39258019)

I found out about the division bug as a beginner programmer! I was trying to write the first MMORPG using Quick Basic. I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close. It fixed it, but new programmers shouldn't be forced to deal with stuff like that.

I've preferred AMDs to Intels because AMD was one of the first sponsors to Esports back in 99. Too bad Columbine happened and I suspect they wanted to distance themselves from Quake tournaments. Another thing I like about AMD was that their processors don't melt if they get hot because they have a self preservation shutdown mode. People said Intel had this, but I melted a processor just a few months ago on SWTOR.

you are mistaken (0)

Anonymous Coward | more than 2 years ago | (#39258055)

You were not hitting the division bug, it happens only with certain combinations of numbers and is quite rare.
You probably just had the rounding mode set wrong.

Re:you are mistaken (4, Informative)

Sir_Sri (199544) | more than 2 years ago | (#39258361)

A floating point precision error. Floating points cannot represent quite a diverse collection of numbers, this is especially problematic when you're doing intersections with small objects. Say a ray projected from an object will, because of the minute errors in floating point, collide with the same object (which produces some cool patterns).

Floating points are kind of crappy. Not that I have a better option with viable performance on a desktop machine. That's not a division bug, that's just the nature of representing numbers in binary with a fixed number of bits.

Re:you are mistaken (0)

Anonymous Coward | more than 2 years ago | (#39258779)

Floating points are kind of crappy. Not that I have a better option with viable performance on a desktop machine.

How about fixpoint or even integer?
512bit calculations aren't that expensive and should deal with a reasonably sized universe in planck-length presicion.

Re:you are mistaken (3, Insightful)

Rockoon (1252108) | more than 2 years ago | (#39259111)

512bit calculations aren't that expensive

Yes they are.

Re:you are mistaken (4, Insightful)

neokushan (932374) | more than 2 years ago | (#39259225)

Except I very much doubt that would solve whatever "problems" this guy was having. As a newbie programmer, it's entirely understandable that he wouldn't know about the fun you can (or can't) have with floating point operations. However, I very much doubt that sheer accuracy was the issue, rather he was probably making assumptions such as 1.0 - 1.0 == 0.0, when in reality the result isn't necessarily exactly 0.0. Considering it's an MMO, he probably had something like "Why is this guy not dying, he has 4 HP left and this attack does exactly 4 damage? Must be a bug!".
Really, it doesn't matter a huge amount, if such "accuracy" is important to your game then instead of doing "if(Health is less than 0.0) /* die */", you do something like "if (Health is less than 0.0 + epsilon) /* die */", with "epsilon" being a very small number (such as 0.00000001).
The real fun with floats, however, is that each platform does something different. It's possible that the OP ran the game on Intel hardware and got one result (which may have seemed more "correct"), then ran it on an AMD machine and got a different (seemingly less-correct) result - you can see why he naturally jumped to the conclusion that the AMD system had a bug.
In reality, chances are both systems were "wrong" anyway, they just happen to use different implementations for floating-point logic. To solve this, once again higher rates of calculations aren't the answer, but rather there's a compiler switch (/fp:strict in VS) that will use the ISO standard floating point model. It's not as fast as the other methods, but you will at least game the same results across different platforms (assuming that CPU has implemented the standard correctly which these days is almost certain).

There's LOTS of fantastic info on this here: http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/ [gafferongames.com]

Re:you are mistaken (0)

Anonymous Coward | more than 2 years ago | (#39259187)

A floating point precision error. Floating points cannot represent quite a diverse collection of numbers, this is especially problematic when you're doing intersections with small objects. Say a ray projected from an object will, because of the minute errors in floating point, collide with the same object (which produces some cool patterns).
Floating points are kind of crappy. Not that I have a better option with viable performance on a desktop machine.

Back when the Intel bug was an issue, you would have been better off using fixed point math, which is what most of the top performing games and raytracing programs were using.

These days, most of those calculations are going to be handled on a GPU which should be much more capable of dealing with the levels of precision you require.

Re:This isn't nearly as bad as the division bug (5, Funny)

Smauler (915644) | more than 2 years ago | (#39258083)

I was trying to write the first MMORPG using Quick Basic.

Sounds like the division bug was the least of your problems....

Re:This isn't nearly as bad as the division bug (5, Interesting)

GoodNewsJimDotCom (2244874) | more than 2 years ago | (#39258503)

Heh. I coded a nice tile based RPG out of it, but I couldn't make it MMOG because there is no socket code in Quick Basic. The trick to making big games in Quick Basic is to write your own Virtual Disk so you can get past the 640k memory limit. Once you have a virtual disk, you can write an interpreted language inside Quick Basic, then your code is simply loaded up in a custom database. I rewrote the whole thing in C/C++ because people told me I could get socket libraries in it, but I gave up on my game entirely when Ultima Online came out because I felt I wouldn't be able to build up a market because my graphics are so bad. I was partially right in thinking there is only enough room for one MMORPG at a time back in 97, but I think I shouldn't have gave up after having coded for thousands of hours with things like Farmville succeeding today.

Re:This isn't nearly as bad as the division bug (0)

drolli (522659) | more than 2 years ago | (#39258693)

What is the problem with Quick Basic? It came for free and it was quite ok.

Re:This isn't nearly as bad as the division bug (3, Informative)

Pieroxy (222434) | more than 2 years ago | (#39258863)

What is the problem with Quick Basic? It came for free and it was quite ok.

No network access? Might be fine for you, but for an MMORPG programmer on the other hand...

Re:This isn't nearly as bad as the division bug (2, Insightful)

Anonymous Coward | more than 2 years ago | (#39258883)

> It came for free

You meant QBasic.
QuickBasic is for money.

Re:This isn't nearly as bad as the division bug (4, Funny)

Corbets (169101) | more than 2 years ago | (#39258097)

I found out about the division bug as a beginner programmer! I was trying to write the first MMORPG using Quick Basic.

I've never heard "choosing the wrong programming language" described as a bug, but hey, however you want to play it off, man.

Re:This isn't nearly as bad as the division bug (5, Insightful)

Anonymous Coward | more than 2 years ago | (#39258117)

Floating point operations are never fully precise. Simple numbers such as 4.0 would be represented as 4.0000000000000213 or 3.99999999999973 if you arrive at this after doing a bunch of calculations.

This is an inherent limitation of how floating point works, and not something that has been "fixed". Programmers still have to worry about this.

Re:This isn't nearly as bad as the division bug (1)

phantomfive (622387) | more than 2 years ago | (#39258165)

I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close.

Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs. If you're using doubles you'll get better accuracy, but with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.

This is why calculators use decimal arithmetic (1)

perpenso (1613749) | more than 2 years ago | (#39258267)

I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close.

Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs. If you're using doubles you'll get better accuracy, but with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.

Just to be clear its not limited to division. Hell, errors can creep in just by converting a decimal number to floating point. This is why calculators use decimal arithmetic, well some of them - like Perpenso Calc for iPhone iPad [perpenso.com] RPN Scientific Stats Business Hex. Try "0.5 - 0.4 - 0.1" in your favorite calculator app, it might indicate whether the app is using the FPU or decimal arithmetic. Of course the app may be doing something naive like the "BASIC MMORG", rounding results. Its naive because it is another source of rounding error, some results are legitimately a little bit off from a nicely round number.

Re:This is why calculators use decimal arithmetic (2)

sgunhouse (1050564) | more than 2 years ago | (#39258481)

Division is division, regardless of the base used. The issue is that in base 10 (aka decimal numbers), division by 2 and 5 always comes out to a finite decimal; in binary numbers only division by 2 comes out to a finite decimal. Dividing by any primes other than 2 and 5 (and numbers involving those primes) will require rounding in both bases (and they may not necessarily round the same way). That is, unless you're only dividing by combinations of 2 and 5, there really is no preferred base.

The main problem with a QBASIC "single" (a 32-bit float) is the extremely limited precision of that type, and not so much how rounding is done. Most calculators these days can handle 8-12 digits, you have to use 64-bit floats (a QBASIC "double") to get anything like that from your program.

Re:This isn't nearly as bad as the division bug (0)

Anonymous Coward | more than 2 years ago | (#39258637)

Your comment could be misinterpreted. I can't tell if that's because you misunderstand, or because it was just ambiguously stated.

Division is very accurate in floats. IEEE division returns the closest float to the correct answer. The same rules apply for addition, subtraction, multiplication, and square root. The problem I think you are referring to is that the float format, having a finite number of bits, cannot exactly represent the infinite range of real numbers. Thus, while the result of an IEEE float division will be the closest float possible, it is usually not the exact real number that you wanted.

This happens with IEEE floats, base-10 floats, and indeed any finite representation tasked with performing arbitrary arithmetic.

The error after a single operation should be on the order of 1/16 million. If you are finding errors in the third digit after the decimal point then either you have done a lot of operations, perhaps using an unstable algorithm, or you answer is 10,000 and your error is 0.001.

Re:This isn't nearly as bad as the division bug (1)

Daniel Phillips (238627) | more than 2 years ago | (#39258831)

Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs.

What kind of lather are you people working up? The subject was, a division bug. Out of spec operation. Not normal IEEE precision issues.

Re:This isn't nearly as bad as the division bug (3, Insightful)

phantomfive (622387) | more than 2 years ago | (#39258855)

Just because you find an error in a division when you were programming your MMORPG in visual basic doesn't mean you've found the pentium bug. If you noticed it happening a lot, it probably wasn't the bug, just normal IEEE precision issues.

Re:This isn't nearly as bad as the division bug (1)

Joce640k (829181) | more than 2 years ago | (#39259209)

with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.

So working with millimeters is completely impossible then?

Bummer. There goes my plan to write a CAD system using metric measurements.

Re:This isn't nearly as bad as the division bug (4, Informative)

JWSmythe (446288) | more than 2 years ago | (#39258501)

    Anyone who's programmed long enough has found unexplainable bugs that are eventually traced down to some bad hardware. :)

    I've preferred AMD over Intel for years. Long ago, in a distant computer store, far away.... We sold 386s, 486s, and Pentiums (or their reasonable clone) from Intel, IBM, AMD, and Cyrix. At the time, I didn't really care who made the chip, they were just built out for the customer.

    Over the years, I learned to prefer AMD for both the price and performance. Plenty of people will argue "but this Pentium is faster than that AMD". Well, it's all nice, but I don't *have* to stay bleeding edge. I never liquid cooled my CPU, video card, and memory. Friends did. I was always impressed with how much they wasted. I'd just wait 6 months or so, and get something better, faster, and cheaper. :) I do like having a high performance computer, so I upgrade every year or so.

    For example, I just set up a couple servers from COTS parts. They used AMD FX-8120's (8 core, 4.0Ghz turbo) for $199.99/ea. It seems the comparable Intel is the i7-980 (6 core, 3.6Ghz), which is selling at $589.99. For the difference in price, I could build out a 3rd server, and still have money left over. Toms hardware suggests the i5-2500K (4 core, 3.7Ghz turbo) for $224.99 or i7-2600K (4 core 3.8Ghz turbo) for $324.99 as comparable. If I wanted to spend a little more, I could have gone with the AMD FX-8150 (8 core, 4.2Ghz turbo) for $249.99. Was $50 for .2Ghz worth it? Not really. Something bigger, better, and faster will be out next year, and the year after, and then I'll buy something new.

      I used newegg.com for all the prices, so it would be fairly even.

    The servers actually use as many cores as I can throw at them, so it's extremely beneficial to have more cores at high speeds.

    My desktop/gaming machine still has a Phenom IIx6 1100T in it. All the games I play, I can leave all the settings turned all the way up. Maybe if I ran benchmarks, I'd see something else gets a slightly faster frame rate, but I can't see any difference. As we all know, various benchmarks show different things.

Re:This isn't nearly as bad as the division bug (2)

billcopc (196330) | more than 2 years ago | (#39258857)

Pardon my curiosity, but it sounds like you're building ~$400 servers out of basic desktop components. What kind of workload are you putting on these boxes that scales so well, yet doesn't justify the added expense of high-end server class hardware ? Maybe I'm at the other end of the spectrum, but I wouldn't dream of running a server without redundant power supplies and premium boards that have been built and tested to rigorous specs. The added hardware expense more than makes up for decreased maintenance and downtime.

I used to work for a guy who built servers out of whatever spare parts he had lying around - obsolete desktops, refurbs, ebay junk, whatever. For a while, we were spending at least 10-15 hours a week keeping those things up, or driving down to the datacenter to physically reboot them. I eventually convinced him to spend a LOT more money on fancier hardware, with IPMI, redundant everything and high-efficiency power supplies. Spending that extra thousand up-front meant we could boot them up and practically forget them, uptimes have gone way up and we were logging more billable hours instead of juggling cheap gear. The results spoke for themselves.

Re:This isn't nearly as bad as the division bug (3, Insightful)

wvmarle (1070040) | more than 2 years ago | (#39259003)

Google is known to build their servers from cheap parts.

Like a RAID, but then a RAIS (Redundant Array of Independent Servers). Load distribution may be an issue as it has to seamlessly reassign tasks when a server is down for whatever reason. But for sufficiently large operations (five servers or more) this sounds to me like the way to go. Instead of trying to make every individual server highly reliable, go with the still very reliable user-grade stuff and get your reliability by redundancy. And companies like Google need more than one server anyway.

Re:This isn't nearly as bad as the division bug (1)

the linux geek (799780) | more than 2 years ago | (#39259035)

Take a look at the benchmarks [anandtech.com] . The FX-8150 really doesn't come out looking good against the 2500k, much less against the i7-980.

Re:This isn't nearly as bad as the division bug (2)

Kjella (173770) | more than 2 years ago | (#39259223)

For example, I just set up a couple servers from COTS parts. They used AMD FX-8120's (8 core, 4.0Ghz turbo) for $199.99/ea. It seems the comparable Intel is the i7-980 (6 core, 3.6Ghz), which is selling at $589.99.

Modded informative? Only on slashdot... Also you compare turbo speeds (and GHz is silly anyway due to the difference in IPC), yet say:

The servers actually use as many cores as I can throw at them, so it's extremely beneficial to have more cores at high speeds.

If all cores are 100% loaded, you're not going to get anywhere close to max turbo. That's the extra boost it can give if only one core is working.

Toms hardware suggests the i5-2500K (4 core, 3.7Ghz turbo) for $224.99 or i7-2600K (4 core 3.8Ghz turbo) for $324.99 as comparable.

Tomshardware never tested the FX-8120, so that's a lie. They tested the FX-8150 and found [tomshardware.com] :

In the very best-case scenario, when you can throw a ton of work at the FX and fully utilize its eight integer cores, it generally falls in between Core i5-2500K and Core i7-2600K

The FX-8120 has 500 MHz lower base frequency which is far more significant than the 200 MHz lower max turbo. Not many have tested it but xbitlabs did [xbitlabs.com] :

Slower eight-core modification, AMD FX-8120, looks even less convincing, because it has significantly lower clock frequencies. In terms of performance, this processor ranks even below the quad-core competitor solutions. Moreover, FX-8120 is also slower than the top previous-generation AMD CPU - Phenom II X6 1100T.

So just admit it, you use AMD because you like AMD but clearly you have no clue what the competition offers.

Re:This isn't nearly as bad as the division bug (1)

Simon Rowe (1206316) | more than 2 years ago | (#39258835)

It fixed it, but new programmers shouldn't be forced to deal with stuff like that.

You're new here aren't you? Software always has to fix up the screwups the hardware engineers made.

Re:This isn't nearly as bad as the division bug (4, Insightful)

sjames (1099) | more than 2 years ago | (#39258059)

Crash bugs are frustrating, but nowhere NEAR as scary as a bug that results in an incorrect but plausible computation. If the program crashes, you KNOW it crashed and you know the runs before that didn't crash are OK.

Note that IRL the two cases can overlap. That is, a bug that might trigger a crash or might trigger an incorrect computation that might be plausible depending on luck of the draw.

Re:This isn't nearly as bad as the division bug (0)

tlhIngan (30335) | more than 2 years ago | (#39258213)

Crash bugs are frustrating, but nowhere NEAR as scary as a bug that results in an incorrect but plausible computation. If the program crashes, you KNOW it crashed and you know the runs before that didn't crash are OK.

Well, there are several problems - in a production environment, you're looking at a possible DoS issue for this. If it happens in the kernel, it can BSoD or kernel panic - putting the whole system offline. Or even worse, it'll continue for a little while and corrupt data in weird and wonderful ways before the misaligned stack finally causes the CPU to walk off the plank.

If it's an application, having some Line of Business app continually crashing causes its own share of problems. Especially since developers may not realize it's a CPU issue and spend weeks debugging lines of code "that should work". And probably not found if you single-step.

Even worse, it happens during heavy load. I'm sure if Anonymous decides to DDoS you, having your server crash just adds icing to the cake. Or if it experiences heavy load during some of the bigger shopping days.

On the bright side, it probably happens very rarely so most production servers probably WON'T see it.

Re:This isn't nearly as bad as the division bug (5, Insightful)

sjames (1099) | more than 2 years ago | (#39258319)

Imagine, there is a tiny bug that makes your floating point results just slightly wrong once in 1000 times. You run an iterative dynamic simulation of a bridge under load that runs for a million cycles. The results LOOK right...

Re:This isn't nearly as bad as the division bug (-1)

Anonymous Coward | more than 2 years ago | (#39259127)

Or a contractor making the actual bridge tries to save a buck by using cheaper bolts than specified. The results LOOK right ....

Which is more likely?

Re:This isn't nearly as bad as the division bug (3, Informative)

icebike (68054) | more than 2 years ago | (#39257991)

And it sounds like the sequence of instructions that causes it is not commonly found.

Really?
Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

x86_64 ABI (4, Interesting)

DrYak (748999) | more than 2 years ago | (#39258527)

Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

That might have been true on 386s.

But currently we're in 2012 and the most widely used instruction set for Linux on AMD processors is x86_64. Because these 64bit processors feature a big number of registers, the two arguments will be passed as registers, not on the stack. So the sequence of instructions isn't indeed common.

Re:This isn't nearly as bad as the division bug (1)

Rockoon (1252108) | more than 2 years ago | (#39259203)

Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

The target function of the call has no business pushing or popping its arguments, ever. It doesnt work. Never has. The caller pushes the arguments and then in some calling conventions (such as STDCALL) the target function removes them from the stack using the return instruction itself ("ret 8" will remove 8 bytes of parameters) while in others the caller itself is responsible for removing the parameters (such as CDECL)

Let me repeat that what you are describing is not possible. When the target function begins executing, the top of the stack is the return address. A pop will be popping that valuable return address, not the first or last parameter which are under it. To be specific, [esp] is the return address and [esp + 4] is the first or last parameter.

Now don't speak when you dont know what you are talking about. Lets be honest here.. you knew that you didn't... now get off my lawn.

Re:This isn't nearly as bad as the division bug (0)

Anonymous Coward | more than 2 years ago | (#39257995)

Well, I may have come across thus bug a few times already.

Phenom X4 820

make -j4 seems to trigger it inside GCC. GCC dies with "internal error" compiling some of my software. Running make again, no problem, intermittent.

IMHO, a test case for CPU errors would be something run under DOS (FreeDOS, etc.) so you have 100% control over the CPU at all times.

Re:This isn't nearly as bad as the division bug (1)

Mprx (82435) | more than 2 years ago | (#39258871)

Running under DOS typically does not give you 100% control over the CPU:

http://en.wikipedia.org/wiki/System_Management_Mode [wikipedia.org]

Re:This isn't nearly as bad as the division bug (4, Insightful)

synthesizerpatel (1210598) | more than 2 years ago | (#39258163)

If your program is 'the kernel' then that qualifies as 'as bad as the division bug' && 'it's a big deal'.

Re:This isn't nearly as bad as the division bug (4, Interesting)

Forever Wondering (2506940) | more than 2 years ago | (#39258249)

Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.

Actually, it could be occurring in other places/programs that aren't crashing but are [silently] producing bad results. The floating point bug, once isolated, could be probed for, and compensated for.

From what I can tell from reading the assembly code, the function is unremarkable except for the fact that it's recursive. It isn't doing anything exotic with the stack (e.g. just pushes at prolog and pops at epilog). The epilog is starting at +160 and the only thing I notice is that there are several conditional jumps there and just above it is a recursion call with a fall through. But, from the AMD analysis, it appears that it's the specific order of the push/pops that is the culprit. In this instance, it's r14, r13, r12, rbp, rbx

The workaround for this bug might be that the compiler has to put a nop at the start of all function epilogs (e.g. a nop before the pop sequence) on every function because you can't predict which function will be susceptible. Or, you have to guarantee that the push/pop sequence doesn't emit the sequence that causes the problem (e.g. move the rbp push to the first in sequence as I suspect that putting it in the middle is what is causing the problems)

Re:This isn't nearly as bad as the division bug (5, Interesting)

bzipitidoo (647217) | more than 2 years ago | (#39258265)

Oh, I've found CPU bugs before. But I never found one others hadn't already found. The 16MHz 80386 had a bug with counters. If you did a REP MOVSW or similar instruction in a 16 bit mode, starting on an odd address, and you made the pointer registers roll over, the CPU would lock up. Couldn't handle the transition from 0xFFFF to 0x0001 in either direction. That was fixed in all the faster 386's. As I recall, there were about a dozen bugs in the 386. Of course later processors were all checked for those specific bugs, so they never happened again.

Then there's unintended features such as pipeline oddities. If you have self modifying code, and it changes the destination of a jump instruction immediately before executing it, the computer will jump to the old address. Step through those same instructions in a debugger, and it will jump to the new address. Strictly speaking, jumping to the old address is incorrect, but it doesn't break any good code and fixing it would wreck pipelining. This behavior has been known for a long time, and every CPU from at least the 386 to the Pentium 4 behaves this way. It wasn't an important problem because so little code was self modifying. Wasn't any good as a copy protection method either, as only an amateur would be fooled by it. I think it's been resolved in at least 2 ways. First, by amending the documentation for the instruction set to expressly state that behavior is undefined in such a case, and second, by proving that there is never any need for self modifying code. And making the separation between code and data explicit. Now we have No eXecution bits.

There are sometimes even Easter eggs. For some processors, a few unassigned opcodes performed a useful operation. It wasn't by design. Is that a bug? Another case was the use of out of bounds values. For instance, the ancient 6502 supports this packed decimal arithmetic mode, in which 0x99 meant 99. So what happened when some joker gave it an illegal value such as 0xFF? 0xFF was interpreted as 15*10+15 = 165, and one could perform some math on it and get correct results. Divide 0xFF by 2 (shift right), and it would compute the correct result of 0x82. That sort of thing makes life tough for emulators, and I have yet to find an Apple II emulator that reproduces that behavior faithfully.

Re:This isn't nearly as bad as the division bug (1)

billcopc (196330) | more than 2 years ago | (#39258905)

there is never any need for self modifying code

There is when you're on a memory-constrained platform, which admittedly the PC is not. Selfmod code is still used in demo coding, especially with 256-byte and 4096-byte competitions, but that is exclusively an academic exercise.

On an embedded system with just a few kbytes of memory, like say an ARM-powered gadget, self-modifying code is still relevant, even in 2012. Just because we can put 4 gigs of Ram in a toaster doesn't mean we should.

Re:This isn't nearly as bad as the division bug (3, Interesting)

AmiMoJo (196126) | more than 2 years ago | (#39258999)

Most of the undocumented op-codes on older CPUs were down to the fact that they were designed by hand rather than having the circuits computer generated. A computer will make sure all illegal op-codes are caught and generate an exception, but human beings didn't bother. Designers put in test op-codes as well which were usually just left in there for production. Even the way humans design circuits makes them more likely to produce useful undocumented op-codes and side-effects.

It was somewhat risky to use them though because the manufacturer might decide to change CPU. The Z80 design was licensed out and any number of companies could supply them, all with their own unique bugs. Some games like to used these features for copy protection and then broke when the producer switched supplier.

Re:This isn't nearly as bad as the division bug (0)

Anonymous Coward | more than 2 years ago | (#39259073)

Couldn't handle the transition from 0xFFFF to 0x0001 in either direction. That was fixed in all the faster 386's

IIRC even 80286 would fire a GPF for accessing a word at 0xffff since it technically accesses 65537th byte of the segment and the limit in real mode is set at 65536.

and second, by proving that there is never any need for self modifying code.

Even on pentium it was still important: add reg,[mem] is 3 times slower than add reg,const. This meant quite a bit in a trifiller. Only later it started to incur such devastating penalties that it didn't pay off anymore.

For instance, the ancient 6502 supports this packed decimal arithmetic mode, in which 0x99 meant 99.

Similar well known thing is that some x86 instructions which have "decimal" in the name contain an interesting 0xa byte in the opcode.. turns out it doesn't need to be 0xa.

Re:This isn't nearly as bad as the division bug (0)

Anonymous Coward | more than 2 years ago | (#39259097)

"For some processors, a few unassigned opcodes performed a useful operation. It wasn't by design. Is that a bug?"

No, it was luck. Those opcodes are not unheard of at all - they were described to me in my computer science degree during a course on designing CPUs (we designed our own very simple one). Essentially the opcodes all get check through combinations of NAND/NOR etc gates to do the useful things. Any unused opcodes have undefined behaviour and the circuitry is simpler (smaller, faster, cheaper) if you use fewer gates so some of the combinations will lock an operation down to a single defined opcode, but the gates can overlap with undefined opcodes too because you haven't bothered adding extra gates that rule the undefined ones out. Some of the undefined opcodes will just duplicate other opcodes, while others may overlap with multiple defined opcodes and as a result do 2 operations in a single opcode which are the useful ones. Problem is some of them won't be that useful because they'll interfere with each other, but others can indeed be very useful. You're always running a risk though - they're completely undefined and just working because of the combination on gates on your particular CPU. Your program'll work fine on that CPU and possibly other similar ones but eventually it'll run on another CPU that uses a slightly different gate layout and your code'll suddenly stop working.

Re:This isn't nearly as bad as the division bug (0)

Anonymous Coward | more than 2 years ago | (#39259167)

"proving that there is never any need for self modifying code"
I'm curious about where the hell this is proved.

Re:This isn't nearly as bad as the division bug (1)

Anonymous Coward | more than 2 years ago | (#39258353)

Hmmm... I don't want to sound cocky, but when a *BSD crashes... it's usually the hardware.

I remember when my FreeBSD based Server crashed regularly, I figured out that the Xeon CPU was broken (cache defect, appeared only under very heavy load).

Re:This isn't nearly as bad as the division bug (2)

mysidia (191772) | more than 2 years ago | (#39258381)

Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

It may be uncommon to be found... but that doesn't equate to not exploitable

Re:This isn't nearly as bad as the division bug (3, Insightful)

Darinbob (1142669) | more than 2 years ago | (#39258485)

CPUs have plenty of bugs. It's not necessarily the last place to look, especially for less popular processors. The only reason it's rarer with Intel and Intel-copying CPUs is because the market is so much bigger and therefore the resources for QA. Actually the bigger and more complex the processors are becoming the more likely it is to have bugs. Of course most are things people don't worry about or that can be worked around by following advice in the errata.

In fact enough people assume CPUs have bugs only in the rarest of cases makes it hard to convince others that you have actually found a bug that's not in the errata. The same thing happens with compilers, you tell people that the bug must be in the compiler and they roll their eyes at you.

So is this the fanboy way to deflect from it? (4, Interesting)

Sycraft-fu (314770) | more than 2 years ago | (#39258575)

You try and find something that "the other guy" had a problem with and bring it up as worse so as to try and "protect" the thing you are a fan about? Because I see nothing about the FDIV bug anywhere but your post.

Oh and you know what that bug applied to, right? The Intel Pentium, the ORIGINAL Pentium. Not the Pentium MMX, not the Pentium Pro, not the Pentium II, not the Pentium III, not the Pentium 4, not the Core, not the Core 2, not the Core i, not the second generation Core i. And yes, that's how many major processor versions from Intel there have been since then (with another to launch in the next couple weeks). The original Pentium chips that had this problem came out almost 2 decades ago, 1993.

So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.

No it doesn't. The story is about the AMD chips, nobody gives a shit about the FDIV bug and I'll wager there are people reading Slashdot who weren't alive when it happened.

The good news for AMD is that processors can often patch around this shit in microcode these days so a recall may not be needed. Have to see, but the potential is there for a software (so to speak) fix.

Re:So is this the fanboy way to deflect from it? (1)

Daniel Phillips (238627) | more than 2 years ago | (#39258907)

I don't disagree with your rant, however it is just not a good idea to dismiss a processor bug as "happened a long time ago". The point is, it happened. Processor bugs happen. And here is one that happened last year [ibm.com] if you must.

Re:So is this the fanboy way to deflect from it? (1)

gnasher719 (869701) | more than 2 years ago | (#39259001)

So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.

Unless the $Product_X fans also point out that the maker of $Product_Y paid an awful lot of money to replace the broken CPUs.

Re:This isn't nearly as bad as the division bug (0)

Anonymous Coward | more than 2 years ago | (#39258853)

The CPU is the last place you'd look for a bug

If you were in the HW business you'd know better:
CPU bugs happen all the time (e.g., here's a 30+ bug list for the core: http://blog.pi3.com.pl/?p=55).
Now a CPU bug that is actionable in user space, that is indeed not easy to find, but I'm not even sure that it is the case here (heck, the guy said he made his own OS to demonstrate it).

Re:This isn't nearly as bad as the division bug (0)

Anonymous Coward | more than 2 years ago | (#39259195)

With the division bug, it was possible to detect it and automatically switch over to a software stack when it was found.

Is there any way to do the same for this bug?

Congratulations to Matt Dillon on finding it. Debugging race conditions and issues that only crop up under severe loads are a real bear to debug.

He sounds wicked smart. (1)

Zorque (894011) | more than 2 years ago | (#39257897)

I wonder if AMD likes apples.

Re:He sounds wicked smart. (0)

Anonymous Coward | more than 2 years ago | (#39257935)

I'm inclined to say AMD doesn't need a hit to its rep right now. I don't like AMD, but Intel needs competition pronto, and I'd prefer the lesser of two evils.

Re:He sounds wicked smart. (0)

Anonymous Coward | more than 2 years ago | (#39258049)

You're thinking of the wrong Matt.

Re:He sounds wicked smart. (1)

Anonymous Coward | more than 2 years ago | (#39258115)

Matt has been a personal hero of mine since he wrote the DICE compiler for the Amiga (late 80's, 90's). He produced so much code back then, that I didn't really think it was a single person producing it all. I think he did some RF work with the 68000 back then too. Really talented person!

Microcode patch (0)

zonker (1158) | more than 2 years ago | (#39257923)

I assume they'll be able to fix it via a microcode patch. Intel had to learn that the hard way...

Re:Microcode patch (3, Informative)

Omnifarious (11933) | more than 2 years ago | (#39257955)

I'm wondering if they will. This seems like a very odd timing issue that may be a problem in the electronics. Of course, I suppose they could just put in some microcode to wait after certain operations to make sure things settle and so avoid the hardware bug.

Re:Microcode patch (1)

Anonymous Coward | more than 2 years ago | (#39258183)

Interestingly enough after mentioning this to my dad, his reply was 'It sounds like crosstalk in the decoder logic (this may be slightly inaccurate since my memory is lousy). So hopefully it's microcode fixable, but given how long it's taken to track down, I assume it hasn't bitten nearly as many people as it should have. Although I have had a number of quirky crashes running an overclocked sempron that sounds very similiar to what this was doing.

Re:Microcode patch (1)

lightknight (213164) | more than 2 years ago | (#39258221)

Indeed. I instantly thought of the microcode update as well.

Their other options are to do a processor recall (like Intel + the infamous Pentium bug), or inform the compiler manufacturers that 'there be the dragons' (special case inserted into the code for the affected processor / architecture to bypass the affliction).

cool, but...? (1)

bcrowell (177657) | more than 2 years ago | (#39257937)

This is cool, but...?

Why does it matter that it's the lead developer of DragonflyBSD?

Re:cool, but...? (2)

drobety (2429764) | more than 2 years ago | (#39257997)

I suppose to be sure he is not confused with the other Matt Dillon.

Re:cool, but...? (1)

icebike (68054) | more than 2 years ago | (#39258005)

Because if it were Joe Random Programmer AMD would not have even listened to him?

Re:cool, but...? (1)

Provocateur (133110) | more than 2 years ago | (#39258023)

Because people read the name and they think it's Matt "There's Something About Mary", "Wild Things" oh ya with the two babes at once Dillon.

Same reason they have to specify millionaire-playboy Bruce Wayne.

Re:cool, but...? (3, Insightful)

ffflala (793437) | more than 2 years ago | (#39258027)

It matters because it's impressive. It also seems fair to associate some of the positive impression with DragonflyBSD, and I cannot see any downside to throwing good PR at any BSD flavor.

Re:cool, but...? (5, Interesting)

wrook (134116) | more than 2 years ago | (#39258113)

Matt Dillon is a rather famous programmer (as programmers go). I assume that's why they mention him by name. I think a very large percentage of old Amiga hackers know who he is. He's also done work on the Linux kernel. Despite all that, he's best known for his work on FreeBSD and on his DragonflyBSD project. While a lot of old timers will know that, not everyone else will.

Re:cool, but...? (0)

Anonymous Coward | more than 2 years ago | (#39259169)

Yes. Matt is a well known code hacker in the Amiga arena, and we respected the hell out of him while he was cranking the hell out of Amiga code while attending UC Berkeley. Everyone shut up and listened when Matt spoke up in the BIX (Byte Information Exchange) forums, and when he went on to work on BSD, we all knew he'd shine in that arena as well. Rock on, Matt.

 

Re:cool, but...? (2)

phantomfive (622387) | more than 2 years ago | (#39258171)

Because now we know not only that something cool was done, but also who did it. Both are relevant.

Re:cool, but...? (1)

maroberts (15852) | more than 2 years ago | (#39258273)

This is cool, but...?

Why does it matter that it's the lead developer of DragonflyBSD?

Because they want to ensure that the world famous programmer and developer is not confused with the little known actor with a similar name,

Re:cool, but...? (1)

MichaelSmith (789609) | more than 2 years ago | (#39258455)

Why does it matter that it's the lead developer of DragonflyBSD?

As a kernel developer he works on code which manipulates CPUs at a low level. Thats why he found the bug.

Re:cool, but...? (1)

jones_supa (887896) | more than 2 years ago | (#39258521)

Why does it matter that it's the lead developer of DragonflyBSD?

It's nice to mention what the guy is known for. Besides, the bug came up when he was tinkering with DragonFly BSD.

If we were completely pedantic, it ultimately does not matter. Anyone interested in computers and programming with enough talent could have found the bug. But yeah.

Nice work (1)

gkndivebum (664421) | more than 2 years ago | (#39257951)

Nice work tracking that one down. It must have been very frustrating - what we used to call a "ring-tailed b1tch"

hidden bugs (1)

AbrasiveCat (999190) | more than 2 years ago | (#39257969)

After reading the links and knowing how I have time trying to find out why something doesn't work right I think I understand why he is so stoked at finding the root of the problem. Good for Matt, maybe they will send you a fixed processor someday.

Buttholes! (1)

For a Free Internet (1594621) | more than 2 years ago | (#39257979)

Buggy buggy buttholes!!!!

Kudos (5, Insightful)

Mannfred (2543170) | more than 2 years ago | (#39258063)

I can only imagine the time and effort spent on tracking down this problem - a rare CPU condition is exponentially more difficult to narrow down than most programming mistakes. A lot of progress in IT depends on engineers like this, who obsessively solve problems even when it's much easier to just ignore them, try to hack around them or pass the buck around. Kudos.

Re:Kudos (1)

justforgetme (1814588) | more than 2 years ago | (#39258467)

Yep, indeed. Kudos to Matt and his insight.

Re:Kudos (2)

paradigm82 (959074) | more than 2 years ago | (#39259211)

I agree but remember that must engineers are working on company time. For most companies it wouldn't be rational to have an engineer working months to isolate/reproduce this CPU bug. After all, this work will particularly benefit this company over all the other companies and at any rate it would be much cheaper to just do the workaround (which might be necessary anyway). However, a good engineer probably couldn't resist looking into this in his free time (and maybe in company time with nobody looking!) at least to prove that he was right. Those engineers are usable so much more valuable than the average engineer, that even if they sometime spent their time on things that are not rational for the company to spend their time on, it is still worth it to have them on the payroll :)

Affected CPUs (5, Informative)

Anonymous Coward | more than 2 years ago | (#39258093)

A pertinent addition to the submission would be which CPUs have been found to be affected.
The second link says Opteron 6168 and Phenom II X4 820. For a second I thought that bulldozer hasn't managed to do anything right, but these two examples are pre bulldozer.
No doubt this is not an exhaustive list.

Re:Affected CPUs (1)

unixisc (2429386) | more than 2 years ago | (#39258149)

That's exactly what I was wondering. The e-mail exchange didn't seem specific about which CPU was impacted. Please don't tell me it's every Opteron or Phenom in AMD's lines.

Re:Affected CPUs (1)

oldhack (1037484) | more than 2 years ago | (#39258423)

Mod the parent up. Any link to full AMD announcement with the list of CPU models affected and the status on any workaround in the works?

Re:Affected CPUs (1)

fa2k (881632) | more than 2 years ago | (#39259037)

I don't even know where to check for more info... I definitely want to get my CPU replaced if it has this bug. I reckon AMD will not be eager to publicize this (if they linked to it from their support page, they would already be better than Intel)

And Linux? (1)

aglider (2435074) | more than 2 years ago | (#39258105)

Windows? Does this mean that those users and devs aren't so important as far as total CPU load?

Linux - not so much concerned (1)

DrYak (748999) | more than 2 years ago | (#39258557)

Linux in 64bit mode use register to pass arguments to functions.
So the most common sequence at the end of a function isn't "bunch of pops then a (near) return", but "move the results into target registers and the return".
Thus the bug sequence doesn't happen that often.

Re:Linux - not so much concerned (2)

dragonk (140807) | more than 2 years ago | (#39258921)

This would be insightful and all -- except that it isn't -- because DragonFly BSD uses the same x86-64 calling conventions as Linux.

Test case (1)

geminidomino (614729) | more than 2 years ago | (#39258177)

Anyone know if that "Test case" image is available? I'd like to check if my Phenom II x6 is affected.

Wouldn't worry about it (2)

Sycraft-fu (314770) | more than 2 years ago | (#39258605)

Presumably AMD will announce affected CPUs fairly soon, after they get done testing. This isn't the kind of thing they would be able to sit on, even if they wanted to. If your CPU has been working for you in general it isn't like it is going to suddenly go and beat up your cat or something, it'll be fine for a bit longer while AMD figures out which ones are all affected and figures out how to fix it.

As I noted in another post, depending on it may be possible to fix it via microcode. CPUs aren't "pure" hardware these days. They have a bit of software that tells them how to do things and on some of them (Intel CPUs I know for sure) it is field upgradable. So they may find a way to patch out the bug.

Just keep an eye on their page, maybe send them an e-mail saying you'd like a notice when they know. Should be soon I'd imagine.

Confirmed CPUs (4, Informative)

Jah-Wren Ryel (80510) | more than 2 years ago | (#39258197)

FWIW:

The failure has been observed on three different machines, all running AMD cpus. A quad opteron 6168 (48 core) box, and two Phenom II x4 820 boxes.

Re:Confirmed CPUs (1)

DigiShaman (671371) | more than 2 years ago | (#39258359)

Interesting. That sounds like a logical bug that can quickly be patched with a microcode update (BIOS or OS level).

security exploit? (3, Interesting)

Anonymous Coward | more than 2 years ago | (#39258251)

I have to worry about stack smashing bugs here... can there be a way for (say) a data pattern in a media file, or carefully crafted javascript or java code that's been JIT-compiled, to break out of its sandbox? What about a hostile OS kernel running inside a VPS container taking over the hypervisor or bare iron? Hmm.

Matt Dillon of Dragon Fly (4, Funny)

hcs_$reboot (1536101) | more than 2 years ago | (#39258405)

Matt Dillon [imdb.com] , desperate after chasing unsuccessfully mary in Something about Mary [imdb.com] radically changed jobs and started to study computer science...

Re:Matt Dillon of Dragon Fly (1)

Forever Wondering (2506940) | more than 2 years ago | (#39258507)

Matt Dillon [imdb.com] , desperate after chasing unsuccessfully mary in Something about Mary [imdb.com] radically changed jobs and started to study computer science...

Matt Dillon and Cameron Diaz had been dating for three years prior to "There's Something About Mary", but split up shortly thereafter. So, maybe, not totally unsuccessful [from a certain point of view--Obiwan] ...

Shlemmie Blakkita (-1)

Anonymous Coward | more than 2 years ago | (#39258567)

flurpy slippah

War story from the network trenches (0)

Anonymous Coward | more than 2 years ago | (#39258573)

In a way this bug reminds me of a problem I found on a certain software-hardware combination involving a ten gigabit Ethernet card. We were seeing mysteriously corrupted SSL connections when there was no reason them to be corrupted. This condition occurred roughly every couple dozen gigabytes of transferred data on a connection that both MSS and very short data segments. After couple weeks of debugging, I was able to conclude that the TCP receive offload engine on the card occassionally injected Ethernet padding as data into the TCP stream that SSL library eventually processed as its' input! Neither Ethernet nor TCP checksums saved from this extra data, since it was conceptually generated on a higher layer; SSL caught it, though. Only God knows how many connections this semi-hardware (but at least software-fixable) bug had corrupted in a less obvious manner elsewhere before it was found out...

A guy like matt (1)

maestroX (1061960) | more than 2 years ago | (#39258983)

I'm pretty stoked... it isn't every day that a guy like me gets to find an honest-to-god hardware bug in a major cpu!

for a guy like me, this is pretty much an honest-to-god bughunt ;)

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>