Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Tracking Down a Single-Bit RAM Error

timothy posted more than 4 years ago | from the you'll-need-a-nice-microscope dept.

Bug 277

Hanji writes "We have discussed here before the potential effects of and protections against cosmic ray radiation, but for the average computer user, it's an obscure threat that doesn't affect them in any real way. Well, here's a blog post that describes a strange segfault and, after extensive debugging, traces it down to a single bit flip, probably caused by a stray cosmic ray. Lots of helpful descriptions of Linux debugging techniques in this one, and a pretty clear demonstration that this can be a real problem. I know I'm never buying a desktop without ECC RAM ever again!" The author acknowledges that it might not have been a cosmic ray-based error, but the troubleshooting steps are interesting no matter what the cause.

Sorry! There are no comments related to the filter you selected.

Ugh, single bit errors (3, Interesting)

Kufat (563166) | more than 4 years ago | (#32685178)

One of my computers had an intermittent failure in a RAM chip/line/something somewhere that mostly manifested as SHA/MD5 failures when I was checksumming large files that I'd downloaded. Never showed up in Memtest86, but eventually I eliminated every other possibility. IIRC, I solved it by underclocking the machine and then replacing it when I was able.

Re:Ugh, single bit errors (1)

Kepesk (1093871) | more than 4 years ago | (#32685538)

And I thought debugging a MythTV install was hard...

Cosmic rays, my ass. Occam's Razor time. (5, Insightful)

Anonymous Coward | more than 4 years ago | (#32685622)

You are on the right track. As someone with over a quarter century of background in combined embedded software and hardware design (the most recent decade for life-dependant systems), it always amazes me how quickly pseudo-technical people jump to wild speculation for observations that they cannot explain.

They fail to understand that a hardware system is an imperfect representation of the theory (probably the biggest failure in the schooling of software developers and even some hardware is to get this message into their heads). While they feel comfort in the theory of a binary system, they utterly fail to understand that our real systems, like us, are imperfect and, like us, live in an analog world. Simple things like temperature variations, noise from common (rather than cosmic) sources, marginal design timing, imperfect components, simple intermittents, etc., are 10^24 times more likely the cause.

But they're not as fascinating as wild speculation, are they?

Re:Cosmic rays, my ass. Occam's Razor time. (1)

GNUALMAFUERTE (697061) | more than 4 years ago | (#32685960)

Not only that, but they are also systems we can only approach from a very abstract perspective when it comes to debugging. Our options to debug complex hardware are very abstract, inaccurate, and incomplete.

Re:Cosmic rays, my ass. Occam's Razor time. (5, Interesting)

Anonymous Coward | more than 4 years ago | (#32686188)

On the subject of the imperfect nature of machines, I found this post by Richard D. James (aka Aphex Twin, a noted electronic music composer) quite interesting. He describes how the physical machinery of analog electronic music machines means it is near impossible to duplicate them in digital programs.

link [archive.org]

Author: analord
Date: 02-07-05 03:14

some people bought the analogue equipment when it was unfashionable and very cheap though.
some of us are over 30 you know!
anyone remember when 303`s were £50? and coke was 16p a tin? crisps 5p

also you have overlooked A LOT of other points because its not all about the overall frequency response of the recording system its how the sound gets there in the first place.
here are some things which you can`t get from a plugin,they are often emulated but due to their hugely complex nature are always pretty crass aproximations..

the sound of analogue equpiment including EQ, changes very noticably over even a few hours due to temperature changes within a circuit.
Anyone who has tried to make tracs on a few analogue synths and make them stay in tune can tell you this,you leave a trac running for a few hours come back and think Im sure I didnt fucking write that,I must be going mental!

this affects all the components in a synth/EQ in an almost infinte amount of tiny ways.
and the amount differs from circuit to circuit depending on the design.

the interaction of different channels and their respective signals with an analogue mixer are very complex,EQ,dynamics....
any fx, analogue or digital that are plugged into it all have their own special complex characteristics and all interact with each other differently and change depending on their routing.
Nobody that ive heard of has even begun to start emulating analogue mixer circuitry in software,just the aesthetics,it will come but im sure it will be a crap half hearted effort like most pretend synth plugins are.
they should be called PST synths, P for pretend not virtual.

Every piece of outboard gear has its own sound ,reverbs,modulation effects etc
real room reverb, this in itself companies have spent decades trying to emulate and not even got close in my opinion, even the best attempts like Quantec and EMT only scratch the surface.

analogue EQ is currently impossible in theory to be emulated digitally,quite intense maths shit involed in this if youre really that interested,you could look it up...good luck.

your soundcard will always make things sound like its come from THAT soundcard..they ALL impose their different sound characteristics onto whatever comes out of them they are far from being totally neutral devices.

all the components of a circuit like resistors and capacitors subtley differ from each other depending on their quality but even the most high quality milatary spec ones are never EXACTLY the same.

no two analogue synths can ever be built exactly the same,there are tiny human/automated errors in building the circuits,tweaking the trimpots for example which is usually done manually in a lot of analogue shit.
just compare the sound of 2 808 drum machines next to each other and you will see what I mean,you always thought an 808 was an 808 right?
same goes for 303`s they all sound subltey different,different voltage scaling of the oscillator is usually quite noticable.

VST plugins are restricted by a finite number of calculations per second these factors are WAY beyond their CURRENT capability.

Then there is the question of the physicallity of the instrument this affects the way a human will emotionally interact with it and therfore affect what they will actually do with it! often overlooked from the maths heads,this is probably the biggest factor I think.
for example the smell of analogue stuff as well as the look of it puts you in a certain mental state which is very different from looking at a computer screen.

then there is analogue tape...ah this really could go on forever....

im quite drunk cant be bothered to type anymore...
so yeah,whatever, you obviously dont have to have analogue equipment to make `good` music in case thats the impression im giving,EVERYTHING has its uses .And not all anlaogue equipment is expensive you can still get bargains like old high end military audio devices,tape machines fx etc just go for the unfashionable stuff.

Richard.

Re:Ugh, single bit errors (0)

sortius_nod (1080919) | more than 4 years ago | (#32685822)

I've had almost the same problem in the past with an old DDR2/AMD machine. Clocked to full speed, it'd fall over repeatedly, clocked down it was fine. I took one of the RAM sticks out, ran fine at full speed.

I'm not sure why you'd want ECC ram in a desktop, unless it's some sort business critical machine that you're willing to spend 5 or 6 times what a normal desktop costs. For day to day use, ECC is overkill. You can get warranty on most chips for 1 to 3 years depending on the manufacturer, and if it's out of warranty, either buy a new machine or buy new ram. All in all, it'll cost more to run ECC due to the board required to effectively utilise the ECC capabilities. I'm not even sure some consumer boards are capable of taking ECC (ASRock and the like that come in cheap desktops).

As someone who has worked in support for 15 years, the troubleshooting shouldn't be "interesting", it's basic diagnosis. The idea that it was "cosmic radiation" is just, well, bullshit. Chips die, have manufacturing faults, or just get old. Nothing new here.

Re:Ugh, single bit errors (3, Informative)

rudy_wayne (414635) | more than 4 years ago | (#32685876)

I'm not sure why you'd want ECC ram in a desktop, unless it's some sort business critical machine that you're willing to spend 5 or 6 times what a normal desktop costs.

This may have been true at one time, but ECC RAM is no longer that expensive. I just looked at prices on Newegg:

8 GB DDR3 $214.99

8 GB DDR3 ECC $274.99

In some cases, depending on the brand and the speed, ECC is actually *CHEAPER*.

Re:Ugh, single bit errors (2, Informative)

besalope (1186101) | more than 4 years ago | (#32686084)

You'll also need a consumer-level motherboard with ECC support. Which are not common, which means you'll be stuck with a server-grade motherboard which costs more, has potential to change: cpu compatibility, case compatibility, and features on the board itself.

There's alot more to making the change from non-ECC to ECC than just swapping out your ram.

Re:Ugh, single bit errors (2, Informative)

Timothy Brownawell (627747) | more than 4 years ago | (#32686120)

You'll also need a consumer-level motherboard with ECC support. Which are not common, which means you'll be stuck with a server-grade motherboard

Or, you know, go AMD. Because they don't limit ECC to only server parts.

Re:Ugh, single bit errors (4, Interesting)

hawguy (1600213) | more than 4 years ago | (#32685918)

I think the original article showed why you'd want ECC in a desktop machine -- random bit errors do happen in real life. I don't see how a warranty makes this less of an issue -- if my machine silently corrupts data due to a bit error, getting a $50 replacement DIMM isn't really going to satisfy me. Does ECC really cost 5X over non-ECC?

If he was processing data or editing a spreadsheet, then that bit error could have corrupted his data. If he was compiling a program for distribution (perhaps to thousands of machines), that bit error could have corrupted his executable, causing errors on all of the machines it was deployed to.

After reading this article, the question that comes to mind is why am I *not* running ECC on my desktop?

Re:Ugh, single bit errors (2, Informative)

hawguy (1600213) | more than 4 years ago | (#32685980)

I went to Dell's site and configured a few Dell Desktops (non-ECC) and Workstations (with ECC), and prices were similar for comparable systems. Though the Workstations that supported ECC didn't support many low-end processors, so if i didn't want ECC and didn't care about processor performance I could have gotten a desktop for about 60% of the price of the cheapest workstation with ECC. But I didn't see a 5x increase for ECC.

Re:Ugh, single bit errors (5, Insightful)

Timothy Brownawell (627747) | more than 4 years ago | (#32686022)

I'm not sure why you'd want ECC ram in a desktop, unless it's some sort business critical machine that you're willing to spend 5 or 6 times what a normal desktop costs. For day to day use, ECC is overkill.

My desktop has 8GB of ECC in it. This cost I think $40 more than non-ECC, and meant I got an Althon II x4 instead of a Core i5. That "5 or 6 times what a normal desktop costs" is either bullshit or Intel-onlyism (which is just another kind of bullshit).

erm.... (-1)

gandhi_2 (1108023) | more than 4 years ago | (#32685182)

so, every time you run the debugger, a cosmic ray hits your memory bus in the exact same way?

Re:erm.... (3, Informative)

JesseL (107722) | more than 4 years ago | (#32685226)

Would it really be so hard to read the article before posting?

Re:erm.... (1)

gandhi_2 (1108023) | more than 4 years ago | (#32686060)

I did read it. I liked the article, actually.

I didn't take into account that he probably never reboots, thereby always using the cached copy.

The k-splice ad on TFA made me laugh in this case.

guess we should put echo 3 > /proc/sys/vm/drop_caches in chrontab.

Re:erm.... (1)

JWSmythe (446288) | more than 4 years ago | (#32685512)

    I was ready to send him a link to purchase a tinfoil hat (and tinfoil server cover too), but in his article, he says it could be cosmic radiation, or flaky hardware. I'd lay money on the second, and not the first.

    I used to joke that cosmic radiation made particular servers crash. We couldn't find any other reason for it, even with a fresh OS (that was identical to our other servers), and swapping various parts. Ya, cosmic radiation went through the building above us, to the server about 30 feet underground, and hit one in the middle of the rack, and not all the ones over it. It was good for the wheel of excuses, but (obviously) not a real answer. Oddly enough, the cosmic radiation stopped messing with that server when we finally took it out of service, and the one that went in the same position, with the same job, running the same OS did fine. :)

    Ya, ya, I know, it's probably whatever part we didn't replace (the motherboard), but cosmic radiation sounded better. :) At the time there were quite a few news stories about it, so I was able to link to those in my report blaming cosmic radiation. :)

    They call me crazy. I call myself eccentric with a sense of humor. :) My girlfriend at the time even made me a tinfoil hat, that I'd sometimes wear around the house as I babbled nonsense about impending alien invasions. :)

Re:erm.... (2, Funny)

Thing 1 (178996) | more than 4 years ago | (#32685932)

My girlfriend at the time even made me a tinfoil hat, that I'd sometimes wear around the house as I babbled nonsense about impending alien invasions. :)

I am both shocked and amazed that you eventually broke up.

Re:erm.... (1)

JWSmythe (446288) | more than 4 years ago | (#32685998)

    Well fear not, it's been a series of upgrades since then. :) My girlfriend now is perfect, I can't imagine a better upgrade from here.

Re:erm.... (2, Funny)

sakdoctor (1087155) | more than 4 years ago | (#32685536)

My RAM is shielded against cosmic rays by my mothers basement.

Re:erm.... (1)

Moodie-1 (966737) | more than 4 years ago | (#32685890)

You live below your mother's basement??? (LOL!!) You'd have to, to be well-shielded from cosmic rays. Living in a basement doesn't shield you from rays coming from above. And even so, some rays are so energetic that they'll reach you even if you lived a mile underground in a mine.

WOW! Cosmic rays? (1)

aqk (844307) | more than 4 years ago | (#32685190)

No wonder I have so many errors on my old laptop! This explains everything!

ask Voyager 2 program managers (1)

jschen (1249578) | more than 4 years ago | (#32685192)

I was hoping this would be more info about the Voyager 2 incident that occurred recently. No doubt, a detailed account of what they recently went through to find and fix the problem would be most interesting.

Takes me back (4, Interesting)

tsotha (720379) | more than 4 years ago | (#32685196)

When I was in college one of my physics professors told us he doubted programs would ever get bigger than a few hundred kilobytes because cosmic rays would cause the larger programs to fail too frequently.

Re:Takes me back (1)

griffjon (14945) | more than 4 years ago | (#32686136)

There's a Redmond joke in here somewhere. Regardless, I'm going to start blaming all my typos on bitflips caused by cosmic rays.

Easter Earthquake (5, Interesting)

ushering05401 (1086795) | more than 4 years ago | (#32685198)

I don't know about cosmic rays, but immediately following the Easter day Earthquake in Guadalupe Victoria (about three hundred miles from where I was located) I tried to fire up my laptop and then my desktop, both of which had been suspended to RAM. Neither one would wake up, though the lappie displayed a garbled screen. No errors in the log files (Ubuntu 9.10 on the sys76 lappie, Deb Lenny on desktop).

Re:Easter Earthquake (3, Insightful)

Darkness404 (1287218) | more than 4 years ago | (#32685236)

Wouldn't that be more likely caused by fluctuations in the power supply though? I'm not an electrical engineer nor an expert on earthquakes, but wouldn't it be possible that a quick loss of power or too high of power for a split second could mess up the data on the RAM?

Re:Easter Earthquake (1)

ushering05401 (1086795) | more than 4 years ago | (#32685268)

I was thinking EMP related to the seismic activity. IIRC that is still somewhat controversial though.

Re:Easter Earthquake (1)

timeOday (582209) | more than 4 years ago | (#32686064)

In all my years of attempting to use suspend-to-RAM on Linux, it has always been, ahem, highly probabilistic.

RAM error? (5, Interesting)

Camel Pilot (78781) | more than 4 years ago | (#32685208)

Forget a RAM error, I have seen a bit on a file on the disk flip.

After years of successful operation a Perl script quite working. On investigation a G was transformed to a W a difference of one bit. The file mod date was years old.

Re:RAM error? (1)

hhedeshian (1343143) | more than 4 years ago | (#32685276)

Yes, so have I (though not quite as entertaininWly as your account). I keep secretly hoping that people's disks corrupt themselves every time they laugh at me for suggesting that ZFS and BTRFS have a place in this world.

Ah, Grasshoppah (1)

gyrogeerloose (849181) | more than 4 years ago | (#32685986)

"What is the sound of one bit flipping?"

Or

"If a disk crashes in a server farm and there's no one there to hear it, does it make a sound?"

Re:Ah, Grasshoppah (1)

robthebloke (1308483) | more than 4 years ago | (#32686172)

no, because we run ssd's.... :p

Re:RAM error? (5, Interesting)

marcansoft (727665) | more than 4 years ago | (#32685278)

I experienced almost exactly that issue with a RAM error. My system was apparently stable, and then one day I got a syntax error in a system Perl script: one character had changed. The script was owned by root and otherwise untouched. After puzzling over it for quite a while I realized it could be a RAM error and ran memtest86. It reported a single permanently stuck bit in my 512MB of RAM. I found a kernel patch to manually mark problem RAM areas as reserved and kept on running with that RAM for a few years.

Are you sure that perl script issue was caused by a drive error? A RAM error can cause the same apparent problem, if the corruption happens in the kernel's cache. However, it shouldn't be permanent as it will not be written back to disk (the cache won't be dirty) unless someone actually modifies the file.

Re:RAM error? (1)

Saint Stephen (19450) | more than 4 years ago | (#32685344)

Would the perl script be loaded at the same address in RAM every time? Wouldn't that likely be a one-time unrepeatable problem?

Re:RAM error? (2, Informative)

marcansoft (727665) | more than 4 years ago | (#32685428)

The perl script will stay cached until something else pushes it out of RAM or until you reboot the system. In general, files are loaded once and stick around for quite a while unless you're low on RAM. In my case, it stayed cached while I investigated it, and I could see the broken character with various viewers. Bad RAM could also cause an intermittent issue if it happened to affect memory used by the Perl interpreter to load the file (that would change each time), but in this case it affected the kernel's file cache, which is quite persistent in the medium or even long term.

I probably had the RAM error for a long time and never noticed. It likely caused a few kernel panics and segfaults along the way, but I probably attributed those to stuff like buggy X11 drivers. The broken Perl script was the first odd thing that I could directly attribute to a RAM problem, later confirmed with memtest86 (the broken bit also matched the change that happened to the character).

Re:RAM error? (0)

Anonymous Coward | more than 4 years ago | (#32685958)

I'm sorry. I messed with my butterflies and flipped your bit. I'll take more care next time.

Re:RAM error? (2, Informative)

Chris Burke (6130) | more than 4 years ago | (#32685434)

Would the perl script be loaded at the same address in RAM every time? Wouldn't that likely be a one-time unrepeatable problem?

If the stuck bit was in the file cache, then it would be repeatable for as long as the script stayed cached, plus you could load the file up in a text editor and see the changed character, etc. Then it would mysteriously go away.

Also (5, Informative)

Sycraft-fu (314770) | more than 4 years ago | (#32685526)

Disks have a lot, and I mean a LOT of ECC on them. It is not a situation of "I need to write a 1 so I'll place one at this location on the drive." They use a complex encoding scheme so that bit errors on the disk don't yield data errors to the user.

Then there's the fact that bits aren't even stored as bits really. All current drives use (E)PRML which is (Enhanced) Partial Response Maximum Likelihood. What this means is bits aren't encoded as a high-low state or FM wave or any of that. They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave. It encodes the partial response it gets, and then finds the maximumly likely pattern that matches.

Sounds like voodoo but works really well. Things are not simple thresholds or the like, it is a complex system and ends up being quite robust and resilient to error.

So it is highly unlikely that you had a bit flipped on a disk. Would require some amazing circumstances to happen. The RAM error is far more likely. Not just the cosmic ray thing but, as the parent noted, bad RAM. Normally when RAM fails, it fails catastrophically and it is immediately apparent. Not always though. It can not only fail on single bit locations, but only during certian ops. That is why memtest does so many different tests. One kind might works fine, another might fail. Rare, but I've seen it on a few systems.

Re:Also (2, Informative)

marcansoft (727665) | more than 4 years ago | (#32685796)

However, single-bit errors are possible with faulty disk hardware. The cache RAM on the disk or its interface can be flaky, and for PATA disks a bad cable can cause single-bit errors. SATA disks usually catch IO errors since they use a more complicated encoding and make use of checksums.

Re:Also (1)

Vellmont (569020) | more than 4 years ago | (#32686186)


However, single-bit errors are possible with faulty disk hardware.

I'm sure you're right, but in this case there's essentially no way a disk hardware failure is going to cause the same bit to fail the same way, but no other bits fail.

In this case, I'd expect it's a bit flip in the OS disk cache.

Re:Also (5, Funny)

Scaba (183684) | more than 4 years ago | (#32686014)

Then there's the fact that bits aren't even stored as bits really. All current drives use (E)PRML which is (Enhanced) Partial Response Maximum Likelihood. What this means is bits aren't encoded as a high-low state or FM wave or any of that. They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave. It encodes the partial response it gets, and then finds the maximumly likely pattern that matches.

I doubt this is true. The disk would have to be spinning at 88 mph in order to activate the flux capacitor, and the power brick would need to supply 1.21 gigawatts to the drive, which exceeds the capacity of even the most tricked-out gaming PC. I think you'd better check your science, my friend.

Re:RAM error? (1)

JWSmythe (446288) | more than 4 years ago | (#32685550)

    I'd second the idea of a filesystem error. I had a mystery error show up similar to what he described. Someone modified one of my files, only changing one character. I was the only one with access to the machine. I fixed it, and voila, problem solved. A few weeks later, filesystem errors started showing up in the system log. It was a failing drive, not just a dirty filesystem. It must have been cosmic radiation damaged the disk. :)

Re:RAM error? (2, Funny)

History's Coming To (1059484) | more than 4 years ago | (#32685340)

Aha, my plan worked perfectly *rubs hands in delight*. I hack the entire internet at once by flipping single bits on a large number of machines. The maths is kind of chaotic. It's fun to track viruses as ant-algorithm analogies too.

Re:RAM error? (1)

Vellmont (569020) | more than 4 years ago | (#32685508)

How did you verify it was actually on the disk, and not read from disk cache in memory?

Disk sectors have CRC checksums on them, so it's just extremely unlikely the bits flipped on the physical medium. It seems even less likely the bit got flipped somehow that caused a write to disk (and your file mod date would suggest this was unlikely as well).

Re:RAM error? (1, Funny)

Anonymous Coward | more than 4 years ago | (#32685574)

Just goes to show you, computers are a bit pedantic.

Re:RAM error? (1)

hondo77 (324058) | more than 4 years ago | (#32685650)

Forget a RAM error, I have seen a bit on a file on the disk flip.

After years of successful operation a Perl script quite working. On investigation a G was transformed to a W a difference of one bit. The file mod date was years old.

Ditto, except it was something like a w to a 7.

Re:RAM error? (2, Funny)

Rinikusu (28164) | more than 4 years ago | (#32685728)

/*After years of successful operation a Perl script quite working*/

And a bit flipped to an e?

Re:RAM error? (1)

Xyrus (755017) | more than 4 years ago | (#32685826)

Quit working? I'm surprised that didn't turn your perl script into pong.

Re:RAM error? (0)

Anonymous Coward | more than 4 years ago | (#32685904)

RAID-1 with checksuming + ZFS + full scrub every week.

It will detect possible flips and restore proper file content / metadata from second hard drive.

Actually ZFS, even without RAID-1 (mirror) automatically keeps metadata 2 or 3 times (depending on importance) even on single disk (they are on different areas of disk for additional safety), so even with single this it can help prevent bit flips. This can be enabled for data using copies=2 or copies=3 (paranoia) attributes.

It's not cosmic. It's from the die/package (5, Informative)

EmagGeek (574360) | more than 4 years ago | (#32685214)

Soft errors in DRAM are far more likely to be the result of alpha particle decay from materials in the die and packaging.

Re:It's not cosmic. It's from the die/package (2, Interesting)

cusco (717999) | more than 4 years ago | (#32685266)

People don't realize that lead is mildly radioactive, and the decay from solders on the connectors or chassis can also cause bit flips. Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.

Re:It's not cosmic. It's from the die/package (2, Interesting)

Vellmont (569020) | more than 4 years ago | (#32685376)


Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.

That sounds a bit fishy.

I _think_ I might be willing to believe the radioactivity of lead, presumably from contamination through some other source radioactive mineral in the ore that decays into radioactive lead. What I have a hard time believing though is that supercomputer makers wouldn't just use non-lead solder, which has been around for years and has actually been mandated for use in recent years in electronics.

Re:It's not cosmic. It's from the die/package (1)

JesseL (107722) | more than 4 years ago | (#32685462)

Never worked with lead-free solder have you?

It's only very recently that it's become practical for widespread use and it's still not settled how well it will work in applications that require maximum reliability. The problems with higher melting points, reduced wetting, tin whiskers, appropriate fluxes, etc. took a long time to sort out.

I'm sure that when a lot of early supercomputers were being built the components used would have been destroyed by the temperatures required to solder without lead.

Re:It's not cosmic. It's from the die/package (2, Interesting)

Vellmont (569020) | more than 4 years ago | (#32685808)

Maybe. It just sounds like an urban legend to me. I was also able to find a 25 year old patent claiming that gold-tin solder assured both high reliability in chip making.

http://www.google.com/patents/about?id=MZY1AAAAEBAJ&dq=4512950 [google.com]

Re:It's not cosmic. It's from the die/package (0)

Anonymous Coward | more than 4 years ago | (#32686232)

I think maybe the point is more to find metals smelted before we detonated thousands of nuclear bombs in the atmosphere. Smelting takes a lot of air to work, and modern metals will have incorporated radionuclides during that process.

Re:It's not cosmic. It's from the die/package (1, Interesting)

Anonymous Coward | more than 4 years ago | (#32685388)

And in lead-free solders, frequently the indium. Sure, it decays really, really, really slowly, but when you're looking at literally single particles potentially causing an issue, it's just another possible cause.

Re:It's not cosmic. It's from the die/package (1)

rolfeb (1218438) | more than 4 years ago | (#32685488)

People don't realize that lead is mildly radioactive, and the decay from solders on the connectors or chassis can also cause bit flips. Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.

I'm unclear as how this "processing" of the lead has reduced its natural radiaoctivity...

--
"Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin

OK, I guess you win!

Re:It's not cosmic. It's from the die/package (3, Informative)

Anonymous Coward | more than 4 years ago | (#32686204)

People don't realize that lead is mildly radioactive, and the decay from solders on the connectors or chassis can also cause bit flips. Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.

I'm unclear as how this "processing" of the lead has reduced its natural radiaoctivity...

Pb-210 is in the U-238 and Rn-222 decay chains, so lead ore in the ground has a constant source of Pb-210 being generated due to uranium contamination. Likewise, radon gas can seep into the lead ore deposits and provide a fresh influx of Pb-210. Once the lead is smelted and purified, the uranium contanimation is removed and it's not being exposed to radon so the number of Pb-210 atoms in the sample starts decreasing significantly.

This would be important (0, Troll)

overshoot (39700) | more than 4 years ago | (#32685784)

People don't realize that lead is mildly radioactive

That is an important consideration for old computers (prior to 2005 or so.) The newer ones are pretty much lead-free.

Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.

Billions of years in the ground, and only a few centuries on the roof and all of the radioactivity is gone! Wow!

Re:This would be important (4, Interesting)

Vellmont (569020) | more than 4 years ago | (#32685948)


Billions of years in the ground, and only a few centuries on the roof and all of the radioactivity is gone! Wow!

The author needs to provide a reference, but there's a few ways I can think of that a processing stage, and a few centuries would produce something less radioactive than something produced more recently. I think all of them stem from the ore containing a source material that gets separated through the refining process, but the daughter products from the source don't. Here's one scenario:

Ore = Lead + radio-isotope a + radio-isotope B.

radio-isotope A decays to radio-isotope B

radio-isotope A: 4 billion year half-life.
radio-isotope B: 20 year half life, decays to stable isotope C.

during refining, radio isotope A gets nearly completely refined out to parts per trillion. radio isotope B is similar to lead chemically, and remains at 1 parts per million (at time of refining).

200 years go by. (10 half lives of radio isotope B)
radio isotope B is now at 1/2^10 concentration, or about 1 part per trillion. Significantly less than when it was first refined. The added radioactivity from radio-isotope A decaying into B is negligible due to the long half-life of A.

These numbers and process are obviously made up to show how it MIGHT work. It still remains to be seen if it's actually true or not.

Re:This would be important (1)

robthebloke (1308483) | more than 4 years ago | (#32686210)

Billions of years in the ground, and only a few centuries on the roof and all of the radioactivity is gone! Wow!
it was blessed....

faulty RAM (4, Interesting)

mojo-raisin (223411) | more than 4 years ago | (#32685228)

I've been working with some large microarray datasets recently, and so had to double my computer's memory to 8GB.

As I've done for years, I went to Fry's to get some Corsair chips... installed F13 64bit to replace my older 32bit distro... and crash-o-matic began. Mostly from Chrome and Mercurial.

I ran memtester86+ and sure enough, verified my first purchase of faulty memory.

So, I went back to Fry's and exchanged for another pair of Corsair 2GB chips. This time, I ran memtester86+ first thing... ANOTHER bad set, so back it sent to Fry's.

*Third* set of memory was Kingston, and a trip through memtester86+ verified no errors. Yay!

Computer has been stable, too.

With more and more RAM in computers, my next box will have ECC.

Re:faulty RAM (0)

Anonymous Coward | more than 4 years ago | (#32685352)

Got bitten by "performance" RAM twice, one stick of of 2*1GB Kingston HyperX DDR developed a single bit error after about a year, second was OCZ DDR2, box suddenly locked up and wouldn't even POST with that stick installed, swap sticks, memtest, 1000s of errors = dead chip. After that I got Kingston ValueRAM ECC for my AMD64 boxes, no problems ever since.

Re:faulty RAM (1)

jd (1658) | more than 4 years ago | (#32685516)

As RAM gets ever-larger, densities get ever-greater, and the energy requirements for corruption get ever-smaller, the amount of error-correction needed is going to increase. That seems obvious. Well, to an extent. There are space-rated chips that use lead-lined casing to make them radiation-resistant. Having the motherboard run cooler will decrease the thermally-generated random noise in the system. If you're using a full-immersion system, the coolant might easily absorb some of the cosmic rays not otherwise blocked. So you have plenty of options in that direction.

Re:faulty RAM (2, Informative)

Burdell (228580) | more than 4 years ago | (#32685844)

Did you buy all new RAM, or add to existing? If you added to existing, did you test just the new RAM, or with the existing in there as well?

Lots of RAM has different timings these days, and even when the timing is supposed to be the same, I've seen new RAM cause problems with old RAM to surface (possibly also from temperature changes). I had a system with 2G (2x1G) Corsair RAM, and then I added another 2G (2x1G) of the same model Corsair; the system started crashing. I assumed (as most would) that the problem was the new RAM. I ran memtest86+ for about 18 hours on just the new RAM and had no problems. I stuck the original 2G back in and the system crashed; I ran memtest86+ on just the old RAM; no problem. With all 4 sticks in, memtest86+ would show errors. By moving sticks around and figuring out the address mapping on my system, I tracked it down to one of the original sticks. I then ran memtest86+ for about 48 hours on just that stick, and it did eventually show an error (Corsair replaced it and I have had no more problems).

RAM generates a good bit of heat these days, and adding RAM generates even more heat in a small space. My faulty RAM has the heat spreaders included, but the motherboard puts the RAM slots so close together there's still little space for heat to dissipate.

fascinating (4, Insightful)

vux984 (928602) | more than 4 years ago | (#32685254)

Its interesting to me because my first instinct would have been to assume something got corrupted and my first step would have been to reboot. If the problem persisted through a reboot then I might have gone down the rabbit hole in similiar
fashion to try and find and fix the root cause.

There are enough sofware bugs, kernel bugs, driver bugs, hardware hiccups due to marginal equipment, power fluctuation, interference, random noise... and i suppose even cosmic radiation that I would rarely think to spend the time to trace a transient problem unless it was reproducible accross reboots, or at least happened on multiple separate occasions.

Too bad many consumer mainboards don't support ECC (1)

Goyuix (698012) | more than 4 years ago | (#32685256)

Some of the nicer boards will tolerate ECC memory being inserted, but won't actually do any meaningful error correction (like scrubbing) - but a disturbingly large number of consumer boards (BIOS limitation perhaps?) don't actually do ANYTHING with ECC memory, and the really cheap ones won't even boot with it present. I used to have the same mindset of purchasing only ECC RAM for the same reason - but the unfortunate truth is that hardware support for it just isn't there without spending $$$ on a decent board too.

Re:Too bad many consumer mainboards don't support (2, Insightful)

Mad Merlin (837387) | more than 4 years ago | (#32685422)

This is one area where AMD is light years ahead of Intel. With Intel, you have to buy a Xeon and a server chipset to have ECC support, which basically is going to run you at least a grand or two just for the CPU and motherboard (at least if you want an i7 based Xeon). AMD on the other hand supports ECC across the board, and you just need a motherboard which supports it, which is most of them (total cost: <$500).

Thanks for the gouging Intel!

Re:Too bad many consumer mainboards don't support (1)

X0563511 (793323) | more than 4 years ago | (#32685514)

Wrong. A few Dell PE servers have P4s in them, and -require- ECC memory.

Re:Too bad many consumer mainboards don't support (0)

Anonymous Coward | more than 4 years ago | (#32685858)

He's referring to all embedded memory controller Intel Parts. Supposedly the Core i3/i5 chips have ECC support enabled on them, but unfortunately none of the consumer boards support them (you'd need an 1156 server mobo + core i3/i5 cpu).

AMD's chips since Socket 939 have supported ECC out of the box. I haven't had a chance to test it myself, but if the Nforce M430 mobo I have will run ECC with a cheap low-end sempron, all my future cpu/mobo purchases will be AMD for just this reason.

Re:Too bad many consumer mainboards don't support (0)

Anonymous Coward | more than 4 years ago | (#32685862)

You do have to carefully check whether the motherboard manufacturer has included the bios support. The upper end models from ASUS and Gigabyte do generally support ECC, but the lower end models and the models from other popular "consumer grade" motherboard manufacturers don't generally include the support. Intel's recent westmere i3 and i5 models do support ECC, but you need a bios support once again, which in the case of an Intel based "consumer grade" motherboard is even more difficult to find. I don't know a single one. OEMs have their own bios versions and support for ECC with their single socket server models, like the other comment states. I'm writing this with my consumer oriented Phenom 9750 and 8GB of ECC memory, so any typos are most likely not my computers fault. ;)

radioactive isotope in the chip (3, Interesting)

mirix (1649853) | more than 4 years ago | (#32685262)

I would think it's more likely there is trace radioactive elements in the epoxy the chip is encapsulated in.

Actually, I recall reading that in the early solid state memory days, they had problems with this. I don't remember what the solution was, but I thought it was to make the circuit somewhat resilient to it, as it was impossible to get 100% neutral epoxy, there's always going to be traces of something radioactive.

I think they tested the cosmic ray theory by running the same chip with and without lead shielding, and did not find a significant difference in errors, they then assumed it was impurities in the chips themselves decaying.

Radioactive packaging (2, Interesting)

overshoot (39700) | more than 4 years ago | (#32685754)

I recall reading that in the early solid state memory days, they had problems with this. I don't remember what the solution was, but I thought it was to make the circuit somewhat resilient to it, as it was impossible to get 100% neutral epoxy,

The worst problem was with ceramic DIP packages -- the really good ones for when you needed reliability (partly because the plastic ones tended to allow moisture to get in, and then condensation on thermal cycling.) The standard ceramic packaging material contained trace amounts of thorium, which is an alpha emitter. The alpha bombardment was enough to flip bits.

There have been several fixes since then. Using materials that don't contain radioactive species was one. The one you're probably remembering is that the manufacturers apply a polymer coating to the surface of the die, which is enough to stop a lot of alpha particles and a fair number of electrons. Getting rid of lead in packaging is also good, because lead tends to contain some radioactive traces.

On the other hand, there's flat nothing to be done about cosmic rays and damn little to be done about X-rays and thermal noise (you do keep your memory cold, don't you? Thermal noise is proportional to KT/qe after all.) So at some point we get to where there are too many bits which need minimal energy to flip them -- and then you have errors.

Pity that so few mobos actually support ECC, though.

Old, old story (5, Interesting)

jmichaelg (148257) | more than 4 years ago | (#32685318)

Back in the early 80's, HP published a paper on random bit errors in RAM. They looked at chips from a variety of vendors and determined that the RAM coming out of Japan was the most reliable. That paper caused a lot of US RAM vendors to shutter their doors as there was a sea change in purchasing habits.

A few years later, I ran into John Scully while we were waiting for a flight. I mentioned the paper to him and asked him how Apple could seriously expect to sell a Macintosh specifically aimed at the Scientific community if it didn't have ECC. He blithely said "it's not a problem..." 20+ years hence and most of us still don't have ECC so it seems he was right.

Re:Old, old story (5, Informative)

Anonymous Coward | more than 4 years ago | (#32685572)

For a more recent analysis (by folks at Google and U.Toronto) see "DRAM Errors in the Wild: A Large-Scale Field Study" in ACM SIGMETRICS/Performance 09.

They did an extensive analysis of DRAM failures from many vendors and debunk several myths as well as indicating that the soft error rate can be much higher than previously thought.

Well worth a read...

http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

 

Re:Old, old story (2, Informative)

Timothy Brownawell (627747) | more than 4 years ago | (#32685894)

as well as indicating that the soft error rate can be much higher than previously thought.

I'm not sure it really does; true they had enormous average (mean) error rates, but it sounded like this was misleading due to an incredibly skewed distribution. Going by the number of servers with zero errors, one error, and multiple errors over a year, and the failures-vs-age data, I came to the conclusion that there's about a 1/5 chance that you'll see one random single-bit error over a typical lifetime (I think I used 5-6 years), but also a similar chance that part of your ram will go bad after a couple years and give you a sudden flood of errors. It would have been very nice if they'd counted servers with 0,1,2,3,...10-20, 20-50, ... etc errors/year (preferably with a pretty graph), instead of only breaking it into zero, one, many.

Re:Old, old story (1)

antifoidulus (807088) | more than 4 years ago | (#32686126)

Actually all of Apple's "pro" products(ie Mac Pros and XServes) DO have ECC ram, a decision that actually caused quite an uproar in the mac community when it was first introduced(with the g5 powermac IIRC). However it has yet to trickle down into any of Apple's other products, which are all 100%* based on laptop components. Do laptops even have ECC ram? With RAM densities increasing bit errors like the one mentioned in the article are only going to increase.

*The quad core iMacs have Desktop CPUs in them, but every other component(including memory) is a part made for laptops.

Cosmic Ray Protection... (2, Funny)

r00tyroot (536356) | more than 4 years ago | (#32685356)

I'm putting tinfoil hats on all of my servers, right away!

Re:Cosmic Ray Protection... (1)

treeves (963993) | more than 4 years ago | (#32685778)

Thick sheets of lead would work better than tin foil. (Pre-emptive whooosh.)

Google Study of DRAM Error Rates (Link Inside) (0)

Anonymous Coward | more than 4 years ago | (#32685414)

Link to Google's PDF is contained in this story : http://www.computerworld.com/s/article/9139161/Google_DRAM_error_rates_vastly_higher_than_previously_thought

From the article,

" A study released this week by Google Inc. and the University of Toronto showed that data error rates on dynamic RAM memory modules are vastly higher than previously thought and may be more responsible for system shutdowns and service interruptions.

The study (download .pdf), which used tens of thousands of Google's servers, showed that about 8.2% of all dual in-line memory modules (DIMM) are affected by correctable errors and that an average DIMM experiences about 3,700 correctable errors per year.

"Our first observation is that memory errors are not rare events. About a third of all machines in the fleet experience at least one memory error per year, and the average number of correctable errors per year is over 22,000," the report states.

"These numbers vary across platforms, with some platforms seeing nearly 50% of their machines affected by correctable errors, while in others only 12%-27% are affected."

The median number of errors per year on a Google server that had at least one error ranged from 25 to 611..."

Re:Google Study of DRAM Error Rates (Link Inside) (1)

Timothy Brownawell (627747) | more than 4 years ago | (#32685976)

About a third of all machines in the fleet experience at least one memory error per year, and the average number of correctable errors per year is over 22,000," the report states.

But also 93% of those with errors, have multiple errors. This permits a bit of number crunching [reddit.com] , to conclude that 3% have single random errors in a year and 30% probably have bad ram or other hardware issues.

All data channels are noisy (0)

Anonymous Coward | more than 4 years ago | (#32685452)

Any electronics/communications engineer will tell you that every data channel is noisy and you must expect corruption at some point, even if the odds seem vanishingly small. And no doubt in these super high transistor count and clock frequency CPUs and chips we are using these days there must be devices and methods used inside them to keep the logic transfer and computation validity on the straight and narrow.

Re:All data channels are noisy (2, Informative)

Chris Burke (6130) | more than 4 years ago | (#32685500)

And no doubt in these super high transistor count and clock frequency CPUs and chips we are using these days there must be devices and methods used inside them to keep the logic transfer and computation validity on the straight and narrow.

Other than ECC on the cache arrays... No. Not a scrap.

If you want reliability on every internal signal and register against cosmic ray strikes, because you're a military or aerospace contractor, you pay boku bucks for it, settle for having way less than what we would currently call performance. And even then I highly doubt anyone is actually putting ECC on each and every bus or set of latches. You just radiation harden the device as much as possible, and then use three of them so if one gets the wrong answer because of a particle strike, the other two will out-vote it.

Re:All data channels are noisy (1, Funny)

Anonymous Coward | more than 4 years ago | (#32685912)

you pay boku bucks for it

Is "boku" some sort of retarded mangling of beaucoup?

Re:All data channels are noisy (2, Funny)

StikyPad (445176) | more than 4 years ago | (#32686046)

Walla!

not cosmic most likely (0)

Anonymous Coward | more than 4 years ago | (#32685490)

People talking about bits flipping in RAM or on disk -- these are external bus errors due to noise before the data gets to the memory or disk drive.

"Could not reproduce" (0)

Anonymous Coward | more than 4 years ago | (#32685542)

Yeah, but can he find a way to reproduce the error?

Interesting tracking, unimportant issue. (0)

Anonymous Coward | more than 4 years ago | (#32685598)

Sure, ECC will be nice at some point on desktop boards.
But unlike what most of these studies *speculate*, ram error rates are still negligible, even over years. And let's face it, there's no point panicking for a possible bit flip every month. Probabilities that your OS goes bad for another reason(disk corruption, bus errors, buggy software) are incredibly higher.
In any case, the point is that unless your data is bit-important, it does not matter. And very few applications need bit important data, and practically none that any desktop computer should be dealing with.

The one thing that would be interesting would be large sets statistics depending on the manufacturers. After all, we all remember the cheap generic DDR debacle a couple years ago.

Had a MySQL problem once. (Once... ha.) (2, Interesting)

falzer (224563) | more than 4 years ago | (#32685704)

I had a mysql replication server which was reading SQL commands from a binary log on a master server. One day after years of operation I noticed an update failed. I didn't see anything at first by looking at the query, but when I looked closely I noticed the query had a single character changed, and of that character only one bit had changed. It was something like a P becoming a Q and thus giving a syntax error.

True story.

Cosmic Rays Tend to Flip Multiple Bits (1)

bezenek (958723) | more than 4 years ago | (#32685760)

Cosmic ray events tend to affect multiple neighboring transistors. For this reason, they tend to affect multiple bits. However, by laying out memory cells so immediate neighbors are from different locations, the ability of single-bit-correction-double-bit-detection (SECDED) methods to detect most events is usually preserved.

The main concern is for structures with no error correction, such as the gates in the processor pipeline. Several research ideas have been put forward. See here (PDF) [umich.edu] for a good overview of the issues.

-Todd

TFA (1)

talcite (1258586) | more than 4 years ago | (#32685878)

I just read the article and it's quite good. The author goes into detail about how he used a series of checksums and source verification to find the bug, isolate it and fix it. I found it quite fascinating and I recommend reading it if you have a few minutes of time.

Just being pedantic (1)

solarium_rider (677164) | more than 4 years ago | (#32685886)

There is no such thing as ECC RAM. The ECC (usually hamming) is performed by the memory controller. You can't just buy a stick of 72 pin DIMM and use that in any old PC. You have to have a memory controller that supports ECC. It should also be noted that this kills performance by increasing latency (decode and encode the ecc bits) and may also require read-modify-writes.

Re:Just being pedantic (1)

Timothy Brownawell (627747) | more than 4 years ago | (#32686100)

It should also be noted that this kills performance

By something like 1-5% if I remember correctly, which only matters in benchmarks and dicksize contests.

and may also require read-modify-writes

Um, yeah... that's only possible if you haven't and don't read anything on that same cache line, and even then mightn't happen based on what assumptions your cache makes or might be no different than non-ECC is your cache is only able to talk to your memory in units of a full cache line anyway.

Reboot? (1)

fava (513118) | more than 4 years ago | (#32685956)

The article author has obviously never used windows. SOP would be a reboot, which would have solved the problem.

The whole thing would have taken minutes.

Re:Reboot? (1, Insightful)

Anonymous Coward | more than 4 years ago | (#32686104)

And leave you in a state of utter ignorance. It isn't about solving it, it's about understanding it.

Ksplice ... go figure (5, Interesting)

GNUALMAFUERTE (697061) | more than 4 years ago | (#32686048)

The guy that posted this is a Ksplice developer. In case you didn't knew, KSplice allows you to patch your running kernel without rebooting. Nice.

Anyway, this guys sees a random memory error. He conveniently goes on a debugging rampage, while we all know the most logical first step would be rebooting that damn machine. Random memory errors do happen.

He says he "hasn't gotten around" to memtesting his RAM yet. So, let me get this straight ... he implies that random cosmic rays caused the error, but he hasn't yet tested his ram for what is the most possible cause of the issue?

Then he goes on to explain that you don't even need to reboot your machine due to damn cosmic radiation. Or kernel updates. Because you have Ksplice.

Come on.

to increase ram or not to increase ram? (0)

Anonymous Coward | more than 4 years ago | (#32686112)

So, here I am with my paltry 2 Gigs of ram in my system drooling over the idea of having some much larger amount, like this fellow's 12 Gigs, and then find out that it's a likely source of errors due to persistent caching of hard drive reads.

Memory failures due to alpha particle switching, one of my faves, or cosmic rays (are we sure we can't get neutrino's in there as well?) were a known evil but it looks like having the cache more frequently overwritten might be an advantage to having smaller amounts of memory. (at least, non-ecc memory.)

Now I have to run off and see if my motherboard will accept ECC memory before I go out and do buy more memory.

A cheaper solution.. (0)

Anonymous Coward | more than 4 years ago | (#32686132)

I found ECC RAM was too expensive for my home server..

so does anybody know where I can get a cheap, THICK lead sheet?

I've seen this (2, Informative)

Eil (82413) | more than 4 years ago | (#32686214)

A few years ago I came across a thread on a FreeBSD mailing list where a build of some package was failing and the submitter couldn't tell why because he wasn't a developer. The failure was unusual and no one else could reproduce it. Eventually, the problem was traced back to a character in the source differing from the original. The character was a one-bit difference from the correct character, and it was suggested to the submitter that he reboot and memtest his memory. Sure enough, one single bad bit out of around 512MB.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?