Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Flawed AMD Chip Can Lead To Data Corruption

Zonk posted more than 8 years ago | from the crunchy-mistakes dept.

203

Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."

cancel ×

203 comments

RED HATE (-1, Offtopic)

fyrie (604735) | more than 8 years ago | (#15226351)

RED ALERT

I dub thee (2, Funny)

Anonymous Coward | more than 8 years ago | (#15226372)

Fetch Div, son of Eff Div, Heir to Count Zero, and Lord of a new generation of digital serfs, soon to be labled as having "emotional problems."

What? (0)

Anonymous Coward | more than 8 years ago | (#15226375)

Overheating leading to data corruption? Since when is this a flaw in chip design?

Re:What? (2, Interesting)

qbwiz (87077) | more than 8 years ago | (#15226379)

Generally, chips aren't supposed to have localized heating problems. Either it should all have a problem, or none of it should.

If my chip is overheating... (1)

Ethan Allison (904983) | more than 8 years ago | (#15226578)

I sure wouldn't want to find out which spots were hotter than others... touching an overheated chip that much would hurt! (Plus, if it gets too hot the CMOS should kill the machine anyway)

Corruption (2, Informative)

XanC (644172) | more than 8 years ago | (#15226387)

Corruption is the cardinal sin of a CPU. If it can't compute a result accurately, it should shut down rather than give a wrong answer.

Re:Corruption (1)

frosty_tsm (933163) | more than 8 years ago | (#15226404)

As in, commit electronic seppuku?

Re:Corruption (5, Insightful)

leendertv (969527) | more than 8 years ago | (#15226558)

No CPU can guarantee to be free of corruption, the goal of the designer is just to minimize the likelihood of corruption. The design margins are usually such that proper operation is ensured, except for the statistical outliers. However, even CPUs with several error checking and correcting mechanisms can still corrupt data, it is just extremely unlikely. A CPU can never know for sure if it can compute a result accurately, or if an operation was performed correctly, just like no communications system can achieve bit error rates of 0.

Data corruption in integrated circuits can come from several different sources. Cosmic rays are likely to alter memory values, especially so in DRAM cells. Typically, only ICs for space applications are actually radiation hardened. Much less likely, transistor device noise can corrupt data. Transistor device noise is usually more an issue in RF circuits. Finally, not all manufacturing defects can be found during manufacturing test, since most test sequences don't even achieve 100% fault coverage under currently used fault models, and this does not even consider how closely the models represent the actually circuit failure modes.

Really, for most people this floating point data corruption is probably a non-issue. It is even more unlikely that errors in floating point data lead to exploits. It is more likely that some bits of your DRAM memory will get corrupted. On my system with ECC RAM that is a few years old, logs show that I get about 1 or 2 (correctable) errors per day...

Re:Corruption (1)

smash (1351) | more than 8 years ago | (#15226685)

Granted, for most people this may well be a non-issue, and data corruption is a fact of life.

However, when a CPU is KNOWN DEFECTIVE in a repeatable, data-corrupting way, it is the vendor's responsibility to replace/fix it.

Similar to vehicle recalls. Most people would never be affected by many of the things vehicles are recalled for, but that doesn't mean that known *serious* defects are simply let go.

smash.

Re:Corruption (1)

dubl-u (51156) | more than 8 years ago | (#15226710)

If it can't compute a result accurately, it should shut down rather than give a wrong answer.

Congress could learn a lot from this.

Who cares (1)

Kwiik (655591) | more than 8 years ago | (#15226736)

don't "we" commonly measure code in terms of how many errors per KLoC [wikipedia.org] it contains? How does any of this pale in comparison?

Some information derived from an old leaked MS email: The internal rule in Redmond is "4 bugs per KLoC".

that's one bug per 256 lines of code..one bug every couple of functions, for the smaller functions.. several bugs per function, for the more intensive functions.

and it's not like OSS is safe from this either. Sure, having error prone hardware will make perfection even more impossible, but seriously, you are far more likely to receive faulty RAM out of the box than to run in to an issue with one of these suckes.

Re:What? (0)

Anonymous Coward | more than 8 years ago | (#15226425)

if it were intel, you would be saying its the worst flaw ever!

I Have an AMD CPU (5, Funny)

ozmanjusri (601766) | more than 8 years ago | (#15226384)

Hey, I have an AMD 2.8Ghz. Maybe I should stop refresðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{

Re:I Have an AMD CPU (5, Funny)

zaguar (881743) | more than 8 years ago | (#15226564)

ðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{

Interesting Perl script.

Re:I Have an AMD CPU (0)

Anonymous Coward | more than 8 years ago | (#15226625)

I just figured it was Tengwar.

Re:I Have an AMD CPU (1)

MadUndergrad (950779) | more than 8 years ago | (#15226640)

Mod parent "LotR nerd". Me too, for knowing what that is.

Re:I Have an AMD CPU (1)

ozmanjusri (601766) | more than 8 years ago | (#15226819)

Interesting Perl script.

Let's see...

$ ./line_noise.pl

Warning: Programmer attempting to re-invent the wheel. There's a function that does the exact same thing on CPAN. Sometimes it actually works.
ERROR: Unable to create life
Exiting

Re:I Have an AMD CPU (1)

arkhan_jg (618674) | more than 8 years ago | (#15226897)

You have a 2.8Ghz Opteron in your desktop PC at home? Don't you have someone to press the refresh button for you?

Kernel fix? (0)

lostngone (855272) | more than 8 years ago | (#15226391)

I'm sure someone will have a kernel patch to prevent this from happening in linux in very short order. The big question is will someone write malware/virus to somehow take advantage of this flaw?

Re:Kernel fix? (4, Insightful)

Umbral Blot (737704) | more than 8 years ago | (#15226411)

The big question is will someone write malware/virus to somehow take advantage of this flaw?

I am curious how a virus could possibly exploit this. It would have to a) hog the resources so that it ran nearly exclusively, which would mean the virus already had control, and b) somehow cause a floating point error to result in a priviliages error. (priviliages and security routines rarely use floating point numbers). Also why would a kernel patch be released for this? It would hurt performance for the rest of us, customers with defective chips should simply return and replace them.

Re:Kernel fix? (1)

Alien Being (18488) | more than 8 years ago | (#15226455)

"...customers with defective chips should simply return and replace them."

Simple for whom? It can be a real pain in the ass to swap a CPU.

AMD says that from now on, chips that have this problem will be rerated to lower clock speeds. It would be nice if they offered customers the option of turning down the clockspeed in exchange for a partial refund.

Overclocking ... (1)

AHumbleOpinion (546848) | more than 8 years ago | (#15226595)

AMD says that from now on, chips that have this problem will be rerated to lower clock speeds..

And then end users will overclock these CPU ...

Re:Kernel fix? (1)

Lucractius (649116) | more than 8 years ago | (#15226607)

Yes i have a faulty cpu..

Turn its clock down, right, yep done that.

So now ill never be affected by this obscure glitch that is almost totaly unreproducable outside of synthetic testing, oh thanks very much.

can i have the check now please ?

*check arives*
*cashes check*
*clocks cpus back up*

Re:Kernel fix? (1)

Alien Being (18488) | more than 8 years ago | (#15226643)

If his overclocking it causes a problem he can kick his own ass.

Re:Kernel fix? (1)

Lucractius (649116) | more than 8 years ago | (#15226701)

as long as i have the nice fat "refund" check for clocking down the cpus, who cares :P

Re:Kernel fix? (1)

KDR_11k (778916) | more than 8 years ago | (#15226979)

You could have bought a "downrated" chip and overclocked it, too. The clock rate on the box is merely specification, the chip can remain operational at higher clock rates, the manufacturer just won't be responsible for what happens then.

Re:Kernel fix? (1)

Crazyscottie (947072) | more than 8 years ago | (#15226527)

In theory, a malicious user could exploit this vulnerability in a routine that's already calling that particular series of instructions. In practice, however, I think it would be nearly impossible to do anything useful, because you'll have no control over the values being written to memory; you'll just know that they aren't correct.

Re:Kernel fix? (0)

Anonymous Coward | more than 8 years ago | (#15226665)

If there were a patch, you would need to select to compile it into the kernel. It would fix your not-so-likely problem. If you don't have an opteron/don't have one that's affected, you wouldn't be compiling the patch in the first place. Victims get a patch, nobody else needs it so they don't use it.

Re:Kernel fix? (5, Informative)

larry bagina (561269) | more than 8 years ago | (#15226534)

I'm sure someone will have a kernel patch to prevent this from happening in linux in very short order.

Not likely. This is valid user code that is being executed. On other CPUs, the same code wouldn't cause a problem. Something like the F00F bug is fixable in the kernel by mucking with exception handler. This is pure user-land code.

Mod parent up please (1)

btarval (874919) | more than 8 years ago | (#15226588)

Agreed; the GP doesn't understand the problem. At best, you might modify gcc; but I suspect that might be a pain, considering it's such a limited problem (according to the rumor mentioned in TFA).

There's no way the kernel can do anything about it, from the description of the problem.

And, contrary to AMD's attempts to downplay this issue, there are two immediate areas that I can think of which are affected. The first are certain scientific calculations (even worse, those involving Beowulf clusters). The second are CAD simulations.

Both areas can involve calculations which run for days at a time; far in excess of the hours mentioned in the fine article.

In general, people don't really seem to pay much attention to either the reliability of the CPU or the quality of the RAM though. Witness the number of really cheap systems that people buy for this type of work. Perhaps this will be a "heads up" that yes, even the most basic subsystem of your computer can go haywire, skewing your results, and wasting your time.

Re:Mod parent up please (1)

something_wicked_thi (918168) | more than 8 years ago | (#15226663)

From what I read, you don't really understand the problem, either.

Even scientific applications are probably not going to be affected by this because even checking the counter on a for loop seems to provide enough of a break to let the FPU cool off. They also say that no applications that were tested exhibited the problem, and I'd say it's pretty likely the first thing they thought of when they tried applications were heavy floating point applications (scientific apps, GIMPS, etc.).

Maybe you'd see the problem on some really tight FPU code, like something in GIMPS during the torture test, but I doubt it. I think the only way you'd manage to get this problem is if you wrote code specifically to do it.

That said, I agree with your thesis: people really need to be aware that computers can fail for *no good reason*. There are techniques to correct that (e.g. quorum sensing), and any large-scale cluster that requires accuracy should be using such techniques, anyway, as MTTF is inversely related to the number of components being utilized.

An old problem (4, Informative)

AndrewStephens (815287) | more than 8 years ago | (#15226392)

Something similar used to happen on very old processors, back in the day. If certain instructions were executed in tight loops, the chips would experience localised heating and eventually malfunction (sometimes with permanent damage).

I'm too young to remember the details (I think it goes back to the early eighties at least), but perhaps some of the elder gods that lurk around here might be able to supply more details.

Re:An old problem (3, Funny)

Alien Being (18488) | more than 8 years ago | (#15226476)

I used to burn out a lot of abacus beads.

Re:An old problem (1)

dhall (1252) | more than 8 years ago | (#15226503)

Beads? Using our sexagesimal system, we didn't have the true concept of zero as a number!

Re:An old problem (4, Funny)

Jerf (17166) | more than 8 years ago | (#15226480)

Do not meddle in the affairs of the Elder Gods [wikipedia.org] , for you are crunchy, and good with ketchup.

Re:An old problem (2, Informative)

Dadoo (899435) | more than 8 years ago | (#15226531)

If certain instructions were executed in tight loops, the chips would experience localised heating and eventually malfunction (sometimes with permanent damage).

You're thinking about magnetic cores.

Whenever you reverse a core's magnetic field, its temperature rises a little. Keep reversing the field fast enough and for a long enough period of time, and the core (or maybe the wires running through it?) will melt, permanently damaging that bit.

Re:An old problem (5, Informative)

Mister Transistor (259842) | more than 8 years ago | (#15226538)

You may be referring to the early MC6800 8-bit processors. The first ones had a major problem in that the internal registers were dynamic RAM style memory, and synchronized to the internal state clock. If you halted the processor for an extended period of time, the refresh clock to them ceased and the registers got hot, drew too much current and burned up!

I'm pretty sure that gave rise to the joke "Halt and Catch Fire"...

I always figured that if you were to burn out a register from overuse, it would be the carry bit ;)

Anyway, as to the story at hand, it sounds like this would only ever occur a) to only 3000 processors total - MAYBE, and b) would only ever happen under such an artifically contrived laboratory stress-test/benchmark situation. Any CPU running in a real system would a) have to do other things like service hardware interrupts, and b) wouldn't do something useless like perform a looping calculation without checking to see if it was done periodically. It really sounds like this is a big non-issue in reality.

Re:An old problem (2, Insightful)

AndrewStephens (815287) | more than 8 years ago | (#15226561)

I agree with your comments on the current story. In reality, all modern processors have flaws that only occur in extrememly unlikely circumstances. This one is not any different.

Re:An old problem (4, Insightful)

Mister Transistor (259842) | more than 8 years ago | (#15226596)

I'll go you one better - I have formed my own personal postulate/theory/law that:

No sufficiently complex system can ever be completely bug-free.

and it's corollary:

It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.

In that vein, someone once said "Foolproof is impossible because fools are so ingenious", and "As soon as an idiot-proof system is devised, they go and invent a better idiot!"

Re:An old problem (1)

something_wicked_thi (918168) | more than 8 years ago | (#15226746)

No sufficiently complex system can ever be completely bug-free.

Not to come off as being too sarcastic, but, wow! You came up with that all by yourself?

Seriously, though, that's one of the main tenets of software engineering. Hardware is no different. The phenomenon is well studied and it basically results from the fact that any change you make to a complex system has a certain likelihood of introducing new problems. Also, as you reduce the number of bugs, the complexity of the remaining bugs increases, thus requiring more complicated fixes. Usually, these factors result in a "steady state" number of bugs (that is, there is a point at which the number of bugs you are fixing is equal to the number of bugs you are adding by fixing them).

Therefore, once a system becomes sufficiently complex, it is impossible to eliminate all the bugs. The only way to fix more bugs, therefore, is to reduce complexity. There are other techniques that can be employed to reduce the number of bugs that are introduced per change (e.g. automated regression testing), but that just lowers the steady-state bug count.

Re:An old problem (2, Interesting)

Soul-Burn666 (574119) | more than 8 years ago | (#15226788)

Actually hardware IS different. As complex as hardware is, it is much less complex than software and has much simpler logic to check. This allows for systems for "formal verification" which happen to work exceedingly well for hardware. For example IBM's "RuleBase" is a system that uses temporal logic to verify a certain piece of "code" (which will later be compiled to hardware) against a set of logical rules.
When the system can be used, it helps clear out logic bugs very efficiently.

That being said, today's microprocessors are huge and therefore have to be split to modules in order to test like this. Moreover, it only tests logic. Other systems have to be used to test issues of overheating, cross-talk and actual physical design.

Re:An old problem (1)

smash (1351) | more than 8 years ago | (#15226669)

Now, not having a go at the parent post, but if intel was to release a statement like this, the /. community would be having a field day.

A defect that is known to give incorrect calculations is a serious issue that should be rectified via microcode update or exchange CPU for free (if microcode can't fix it).

Intel got raked over the coals for the FDIV problem, and so should AMD unless they do the right thing and offer an exchange/free fix so that users get the functional CPU they intended to purchase.

smash.

Re:An old problem (5, Informative)

something_wicked_thi (918168) | more than 8 years ago | (#15226767)

RTFA. They are offering a free replacement. However, the FDIV bug was overblown. For most people, it didn't matter (few people were using software that required division precise enough to be affected). This bug is even less worrisome. Its effect is, at the moment, completely unobserved in the wild using real world applications. The FDIV bug was apparent to anyone with a calculator.

I'm not saying AMD should be let off the hook completely, but the bug isn't a big problem, they are offering free replacements, and they publicized it. The FDIV bug was bigger (though still hardly catastrophic), refused (at first) to offer replacements, and they sat on it. The two scenarios are nowhere near similar. Maybe AMD just has more character than Intel, or maybe they were watching in 94/95 when the FDIV bug happened and they've actually learned from Intel's mistakes. Regardless, this whole story is more of a heads-up to concerned buyers than a criticism of AMD.

Re:An old problem (1)

Bill Dog (726542) | more than 8 years ago | (#15226768)

Bet there won't be any "I am AMD of Borg, you will be approximated" jokes. Afterall, sacred cows and humor do not mix.

Re:An old problem.. now usedto fight the overlords (1)

keen (86192) | more than 8 years ago | (#15226612)

These flaws only occur in unlikely circumstances, but they will be useful tools when fighting our new computer overlords.

Re:An old problem (1)

dubl-u (51156) | more than 8 years ago | (#15226705)

For those wondering: Jargon File [catb.org] and Wikipedia [wikipedia.org] entries for HCF.

Re:An old problem (1)

Warg! The Orcs!! (957405) | more than 8 years ago | (#15226642)

Well I am no Elder God but I knew them when they were kids.

There used to be ways of programming certain early personal computers to make smoke come out of them. I think the BBC computer and the ZX80 were the main ones. The BBC was vulnerable (dredges memory) to a POKE command that would make it fall over and die howibbly

Prince Ludwig - You will die howibbly!
Blackadder - Howibbly?
Prince Ludwig - Howibbly howibbly

Re:An old problem (1)

Tim C (15259) | more than 8 years ago | (#15226865)

Yes and no; I've never heard any such rumour about the ZX80 or the BBC Micro. I heard that POKEing a certain memory location on the Commodore Pet would cause it to burst into flames, but never saw it happen so can't confirm it. A quick google turned up this page [old-computers.com] , which has details about the Pet rumour and the BBC Micro one, but nothing about the ZX80.

Re:An old problem (1)

Warg! The Orcs!! (957405) | more than 8 years ago | (#15226887)

swiftly researched...

It's an 'urban myth' which was made up about the BBC Micro. However, it was based on a true story about the PET - there was a location you could poke to do with the graphics frequency which if you set it wrong could cause the HT supply in the monitor way over-voltage, which would sometimes break down the transformer. This came up in the PCW magazine* after someone wrote "it is impossible to damage a computer with bad software".

So I've been urban-mythed again, eh? I apologise to the designers of the BBC Micro for perpetuating the notion that their machine would combust.

Fearmongering? (2, Interesting)

zaguar (881743) | more than 8 years ago | (#15226395)

Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow - which is what this AMD weakness is purporting to allow. However, how many are affected? Only a few of the AMD chips, and AMD has only what, 30% of the market. So to code an exploit, you would be writing to a very limited audience, to a point where it is futile. Why not just exploit the latest create.Textrange of WMF exploit in IE/Windows? Much more money in that.

Re:Fearmongering? (2, Insightful)

Saven Marek (739395) | more than 8 years ago | (#15226443)

> Only a few of the AMD chips, and AMD has only what, 30% of the market.

The intel fanboys have been too noisy lately! AMD has more than 50% of the market since this year already!

Re:Fearmongering? (0)

Anonymous Coward | more than 8 years ago | (#15226453)

yes but intel has more sold in the past...

AMD has sold around the same amount last year but intel still has much more chips in poeple homes and offices...

Re:Fearmongering? (1)

andy_t_roo (912592) | more than 8 years ago | (#15227099)

do you have a reference for that figure? (for the record i just bought an AMD)

Fearmongering? No, you misunderstand ... (2, Informative)

AHumbleOpinion (546848) | more than 8 years ago | (#15226604)

Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow

I think you are misunderstanding the nature of the problem. This is not data corruption as in buffer overflow, this is data corruption as in the calculation comes up with an incorrect answer. For some people that is not acceptible.

Re:Fearmongering? No, you misunderstand ... (2, Funny)

larry bagina (561269) | more than 8 years ago | (#15226720)

yeah, 1.0 + 1.0 = 3.0, for sufficiently large values of 1.0

Re:Fearmongering? No, you misunderstand ... (0)

Anonymous Coward | more than 8 years ago | (#15226940)

No, you have that wrong. 1.0 + 1.0 = 2.1 for sufficiently large values of 1.0

Your problem was specifying 1.0 which indicates the acuracy is to 1 decimal place (or maybe 2 significant figures, but I'm not sure about that and it wouldn't change anything). The true value of 1.0 has to be between 0.95 and 1.05 so it could never add up to 3, but the sum of the true values could possibly be rounded to 2.1

You did indeed mean: 1 + 1 = 3 for sufficiently large values of 1.

Re:Fearmongering? (0)

Threni (635302) | more than 8 years ago | (#15226994)

> Is it reasonable to be afraid of this.

I'd imagine it could be quite annoying if it were part of the payload of a virus, and if the corruption affected other processes being handled by the CPU, yes.

having this happen (1)

mikesd81 (518581) | more than 8 years ago | (#15226413)

To trigger the effect, the loop has to be run millions of time, an AMD customer source told Reg Hardware, potentially for hours at a time with no other operations being introduced during the run.

A flaw is a flaw, no doubt. However how likely is this particular scenario to happen other than a benchmark test? And 3000 CPU'S? In the news lately, this is almost categorizes as an oopsy. Security forms are losing millions of customers SSN's and everything. AMD could probaly tell you how to identify the CPU and afford to setup a program to exchange.

Re:having this happen (0)

Anonymous Coward | more than 8 years ago | (#15226438)

AMD does tell you how to ID it and is exchanging them

Sounds familiar (1)

swansontec (953822) | more than 8 years ago | (#15226418)

Memory fetch, multiplication, addition... where have I heard this before? Oh, I know. 3D graphics. Typically, those results go right to the screen and don't cause much damage if they get corrupted. I would be more worried about video or audio encoding, though, since those results do make a difference. Otherwise, I can't think of much else that would trigger this bug.

Re:Sounds familiar (1)

smash (1351) | more than 8 years ago | (#15226655)

Business spreadsheets (price = cost + (cost*markup%))? Scientific modelling?

There, that wasn't so hard to think of?

smash.

Re:Sounds familiar (1)

swansontec (953822) | more than 8 years ago | (#15226673)

It would only work if your spreadsheet had a few million cells in it. Now that you mention scientific modelling, though, I feel stupid. That is probably the single biggest user of repetetive floating-point operations around.

Re:Sounds familiar (4, Informative)

smash (1351) | more than 8 years ago | (#15226749)

Hmm.... I doubt you'd need a few million cells though.

Some of the tendering spreadsheets i've seen for a few companies i've worked for have had quite a lot of calculation going on in them - change a few cells that others depend on that have others depending on them, etc.... do that all day, it adds up quick.

You only need 1 of those operations in that instance to screw up and you could be down a few million dollars, if it's not picked up.

Even forgetting that it's just the moral thing to do...Risk vs replacement cost = no brainer. If only 3000 cpus are affected at say $300 each for amd to sell retail (i'm sure their cost is FAR less), they'd be mad not to just do it (maybe even offer a free speed bump) and reap the positive PR.

All it needs is for ONE company to blame a budget blowout on them and it's well and truly paid for...

smash.

Re:Sounds familiar (1)

corychristison (951993) | more than 8 years ago | (#15226659)

This actually happened to a customer of mine.

He's a heavy gamer... for the longest time we thought it was something like the power supply or maybe the RAM... turns out it was the processor.

Long story short, we replaced the processor and I haven't heard any complaints yet.

Obligatory (1)

suv4x4 (956391) | more than 8 years ago | (#15226422)

loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations

I've been saying that for ages, check your results, but naah! Them young'uns and their series of memory-fetch, multiplication and addition operations.

Bug or chip quality? (1)

daybot (911557) | more than 8 years ago | (#15226430)

So is it a flaw in the design or simply a few high temp FPU batches that cook when hot? Do those outside the 3K "affected" chips have the same inherent flaws but simply don't get so hot?

It's not the first time their server chips have experienced heat problems...

Uh oh.. (5, Funny)

BigZaphod (12942) | more than 8 years ago | (#15226450)

Wow! AMD has invented a way to crash an infinite loop! Awesome! Intel? I bet their solution will take twice as long to crash this loop:

10 PRINT "HELLO WORLD"
20 GOTO 10

AMD is always innovating.

Re:Uh oh.. (1)

ultranova (717540) | more than 8 years ago | (#15226674)

Wow! AMD has invented a way to crash an infinite loop! Awesome! Intel? I bet their solution will take twice as long to crash this loop:

10 PRINT "HELLO WORLD"
20 GOTO 10

This loop won't crash. Memory fetch, addition and multiplication, remember ? So you'd need something like this:

10 I = 10
20 K = I
30 K = K + 2
40 K = K * 2
50 GOTO 20

Re:Uh oh.. (1)

mOOzilla (962027) | more than 8 years ago | (#15226799)

10 K = 10 20 K = K + 2 30 K = K * 2 40 GOTO 10

Nothing to see here (-1, Offtopic)

Isotopian (942850) | more than 8 years ago | (#15226461)

Move along please...

Deja Vu: Intel Processor's Bug in 1994 (-1, Redundant)

Anonymous Coward | more than 8 years ago | (#15226481)

In 1994, Intel's Pentium processor suffered from a division error [willamette.edu] . Intel handled the problem by initially requiring customers to "prove" that the error caused a serious impact on the customers' lives before Intel would agree to replace the defective chips. Later, after much pressure and lost credibility, Intel agreed to replace all the defective chips without requiring the customer to "prove" his case.

AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?

Deja Vu: Intel Processor's Bug in 1994 (3, Insightful)

reporter (666905) | more than 8 years ago | (#15226491)

In 1994, Intel's Pentium processor suffered from a division error [willamette.edu] . Intel handled the problem by initially requiring customers to "prove" that the error caused a serious impact on the customers' lives before Intel would agree to replace the defective chips. Later, after much pressure and lost credibility, Intel agreed to replace all the defective chips without requiring the customer to "prove" his case.

AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?

Re:Deja Vu: Intel Processor's Bug in 1994 (4, Informative)

Anonymous Coward | more than 8 years ago | (#15226505)

"The company is also working with OEMs to identify affected parts and contact customers who could be affected - if they are, they will be offered free replacements."

forth paragraph in TFA.

Re:Deja Vu: Intel Processor's Bug in 1994 (1)

cowbutt (21077) | more than 8 years ago | (#15226521)

Later, after much pressure and lost credibility, Intel agreed to replace all the defective chips without requiring the customer to "prove" his case.

AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?

AMD have probably learnt from Intel's PR disaster. Without Intel's FP bug and the precedent it set, there's a good chance AMD would attempt to handle this problem the same way Intel did theirs. Business is business. An indication of this is that every component in the computer you're using probably has both documented and undocumented errata, and recalls are pretty unusual events.

CALL ESP (3, Interesting)

Myria (562655) | more than 8 years ago | (#15226680)

Probably the easiest errata to come by is the instruction "CALL ESP" (or "CALL RSP"). On AMD CPUs, "CALL ESP" will jump to the address in ESP, *then* push the return address. However, on Intel CPUs, it will push the return address first, then jump to the value it just pushed. This is, of course, disasterous if you try to use it.

According to Intel errata documents, this is a bug in the Pentium Pro that has been kept for several generations. The Pentium and below, except the 8086 and 8088, worked correctly with this instruction.

If you want to differentiate Intel and AMD in your program and don't want to use CPUID, you can set up a test with CALL ESP.

Melissa

Re:Deja Vu: Intel Processor's Bug in 1994 (1)

dukiebbtwin (912572) | more than 8 years ago | (#15226939)

True, but think of all the money and resources lost by Intel for a rare error that would not effect the vast majority of the customers using the chip. How many perfectly decent chips were just thrown away, passing the cost onto the consumer who had to make up for the money lost in remanufacturing these chips. Here is a report done by Intel on how often an average user might see an error: http://www.intel.com/support/processors/pentium/fd iv/wp/6.htm [intel.com] It's certainly bad PR for AMD and they will most likely offer an exchange program like Intel, but the practical need for exchanges isn't really there (if what I am reading in other comments is correct).

Re:Deja Vu: Intel Processor's Bug in 1994 (1, Insightful)

mojotooth (53330) | more than 8 years ago | (#15226570)

Jesus. The things that people attribute to AMD's "moral superiority" here on Slashdot... It's astounding.

If AMD does "the right thing" it won't be because of a moral high road. It's because Intel already stepped on a similar PR landmine long ago. Learning from your rival's huge mistakes is not worth high praise. It's just common sense.

Re:Deja Vu: Intel Processor's Bug in 1994 (0)

Anonymous Coward | more than 8 years ago | (#15226733)

You don't understand. We're all excruciatingly dimwitted here. So we can only comprehend things in dirt simple terms, like "good" and "evil". And if something is "evil", its rival must be "good", because if there was an ounce more of nuance or complexity involved than that, us Slashdotters' brains would explode!

nice! (3, Interesting)

B3ryllium (571199) | more than 8 years ago | (#15226506)

Wow, that was fast. FreeBSD already has a patch for this.

Judging from the posting date, I *really* need to be updating my sources more often. :)

20060419: p7 FreeBSD-SA-06:14.fpu
                Correct a local information leakage bug affecting AMD FPUs.

(could be an unrelated correction, I guess, it doesn't provide much more information in /usr/src/UPDATING)

Re:nice! (1)

B3ryllium (571199) | more than 8 years ago | (#15226541)

Ah, I believe I may be incorrect - the longer description sounds like an unrelated FPU bug:

FXSAVE and FXSTOR [freebsd.org]

Re:nice! (-1, Troll)

Anonymous Coward | more than 8 years ago | (#15226545)

Information leakage != overheating and crashing. Just so you know.

Re:nice! (3, Informative)

larry bagina (561269) | more than 8 years ago | (#15226547)

it is an unrelated correction:

...As a result of this discrepancy remaining unnoticed until now, the FreeBSD kernel does not restore the contents of the FOP, FIP, and FDP registers between context switches.

source [net-security.org]

Re:nice! (1)

B3ryllium (571199) | more than 8 years ago | (#15226552)

Caught that already. Sorry for the disinformation. :)

It's like you're overclocking when you're not (4, Insightful)

IvyMike (178408) | more than 8 years ago | (#15226554)

This is different than the Intel bug; that was a logic flaw, where the chip computed a floating point quantity using an incorrect algorithm. This is an implementation error. In fact, the article mentions that they're going to re-spec the parts and they'll be fine. So if you've got a 2.8Ghz part, and you run this loop at 2.8Ghz (within the old spec), it's like you're "overclocking" (because you're actually outside of AMD's new spec). My guess is that if you over-bought your heatsink and got something better than the stock OEM cooling solution, you would be fine even if you ran this loop all day. Yay, arctic silver!

Re:It's like you're overclocking when you're not (0)

Anonymous Coward | more than 8 years ago | (#15226757)

If you accidently saw off your leg, doctors say rush to the hospital or you'll probably bleed to death, but my guess is that if you use plenty of paper towels, you would be fine even if you ran a marathon.

Golly! (0)

Anonymous Coward | more than 8 years ago | (#15226568)

That would be the first thing I do with my CPU. How about you!?

Quality Assurance? (0)

Anonymous Coward | more than 8 years ago | (#15226602)

From TFA

AMD said it has introduced another screening test to catch any further affected parts. Chips caught in this test in future will be re-rated at a lower clock speed to prevent the problem.

Don't you find this to be a bit disturbing? (pun intended)

I wonder how hot the circuit had to be to fail. Let's say 100^C.
Now, that means AMD test their chips at less than that, let's say 90^C.
At least this bug was found. How many more like it are there, but we simply don't have the proper trace to find it?

--

Re:Quality Assurance? (1)

fabs64 (657132) | more than 8 years ago | (#15226637)

Actually it's very common for cpu manufacturers to just underclock overheating chips.
Also, an AMD chip is only rated up to around 75^C anyway from memory.

Re:Quality Assurance? (1)

Bert64 (520050) | more than 8 years ago | (#15226970)

Intel had a very similar problem with some of their Itanium chips recently too, however i don't recall them offering free replacements, i believe they just told customers to clock down affected processors!

However, very few people cared because very few people use itanium chips, and those who do are used to them not performing as advertised.

Could be worse (2, Funny)

Khith (608295) | more than 8 years ago | (#15226618)

Just imagine if you had one of those Pinnacle chips and accidently pressed @[=g3,8d]\&fbb=-q]/hk%fg followed by delete..

Prime95 as a detection tool? (2, Informative)

Antiocheian (859870) | more than 8 years ago | (#15226622)

I have used Prime95 in the past to identify problematic configurations. It's a tool whose main goal is to find prime numbers, but it can be used as an excellent stress test for the processor and memory units.

Could Prime95 be used to identify those AMD chips?

Mod parent up! (1)

Mixel (723232) | more than 8 years ago | (#15226945)

I've run Prime95 on two of my boxes. Fine on one. My AMD Athlon XP 2000 (Socket A), which I often suspected to be unstable, reliably dies after less than an hour of running the Prime95 code (originally discovered from running the Seventeen or Bust [seventeenorbust.com] client, which includes the same Prime95 code). BIOS says temperature is normal, so I'll just blame the motherboard caps for now.

Either way, Prime95 has given proof to my suspicion like no other tool could. I no longer run important/intensive apps there.

Re:Prime95 as a detection tool? (1)

PatrickThomson (712694) | more than 8 years ago | (#15227037)

I was always under the impression that prime numbers weren't floating point. I've never seen a prime finding algorithm that exclusively used floating point.

Uh oh (3, Funny)

IAMTHEMEDIA (869196) | more than 8 years ago | (#15226632)

How about that, I was wondering why my computer was giving me the message "all your base belong to us". heh ok that was dumb but hey! its slashdot, i know one of you laughed. But seriously, I do have this chip and my computer is evil, therefore, it must be the chip! Not the fact I have 98 gigabytes of music porn and uhh porn.

Re:Uh oh (1)

IAMTHEMEDIA (869196) | more than 8 years ago | (#15226802)

Why do I keep getting scores of -1?? Are my jokes THAT bad?

Data corruption? (-1, Redundant)

cciRRus (889392) | more than 8 years ago | (#15226656)

I'm using an AMD 2.6GHz CPU and so far everything has been goo#C@^&*(ewa;'1`
NO CARRIER...

Quality Control at AMD must be good. (5, Interesting)

MROD (101561) | more than 8 years ago | (#15226690)

Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.

The actions needed to cause the problem to arise are so extreme that they'd never happen in the field. i.e. Loop through tight floating-point only instructions without any comparisons for maybe hours before the error occurs.

This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.

This is a theoretical problem only in the real world, especially as it only affects about 3000 processors in total (it has been quoted). This is why AMD gave it such a low priority. We should just forget about it and move on.

Re:Quality Control at AMD must be good. (1)

Izrath (922686) | more than 8 years ago | (#15226861)

Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place
Im an AMD fanboy but I used to know a guy that worked QA for one of the Intel plants here in AZ. He said they run the chips through very intense stress tests and such for days... if one has a problem they toss the whole batch.

Not too keen on the manufacturing process of chipsets myself, but I would think AMD's QA is comparable.

Re:Quality Control at AMD must be good. (0)

Anonymous Coward | more than 8 years ago | (#15226937)

They were probably specifically trying to find ways to overheat parts of the chip, by devising instruction patterns that would exercise specific regions of the chip the most, and then testing these instruction sequences with their chips. I imagine finding the worst instruction sequences is a difficult combinatorial problem on its own, considering the complexity of modern CPUs.

This will not happen to you (4, Informative)

Bloater (12932) | more than 8 years ago | (#15226828)

If you have any interrupts coming in, or your loop has a termination condition. I think you have to have your hardware set to send an interrupt many hours in the future then start an otherwise nonterminating loop.

So under normal conditions on normal PC hardware, this simply won't happen.
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...