Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Google Finds DRAM Errors More Common Than Believed

kdawson posted about 5 years ago | from the forget-me-not dept.

Data Storage 333

An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.

cancel ×

333 comments

Sorry! There are no comments related to the filter you selected.

Percentage? (4, Interesting)

Runaway1956 (1322357) | about 5 years ago | (#29661065)

"a mean of 3,751 correctable errors per DIMM per year."

I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.

Re:Percentage? (0)

sopssa (1498795) | about 5 years ago | (#29661079)

Well haven't Google always used thousands and thousands of normal pc's in their server farms instead of powerful, actual premium server-grade hardware.

Not really a surprise that they tend to break more.

Re:Percentage? (5, Informative)

Runaway1956 (1322357) | about 5 years ago | (#29661191)

No, I don't believe so. They use server boards, custom made to their specs. And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all. http://news.cnet.com/8301-1001_3-10209580-92.html [cnet.com] If you're really interested, that story gives you a starting point to google from.

Re:Percentage? (2, Insightful)

The Archon V2.0 (782634) | about 5 years ago | (#29661509)

No, I don't believe so. They use server boards, custom made to their specs.

I suppose it depends on how you define "server board". Room for tons of ECC RAM and two CPUs is server or serious-workstation class (or maybe I-just-use-Notepad-and-my-sales-guy-is-on-commission class), but I think once you're on to custom boards that only use certain voltages of electricity, you've moved into a class by yourself.

And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all.

Section 7: "All DIMMs were equipped with error correcting logic (ECC) to correct at least single bit errors."

So, yes, it's ECC.

Re:Percentage? (5, Funny)

silent_artichoke (973182) | about 5 years ago | (#29661535)

You know, maybe googling it isn't the best idea in this case. Memory errors and all...

Re:Percentage? (1)

mcgrew (92797) | about 5 years ago | (#29662293)

You know, maybe googling it isn't the best idea in this case. Memory errors and all...

I was going to debunk that but I forgot what was on my mind. Damn dimms!

Re:Percentage? (2, Interesting)

skirtsteak_asshat (1622625) | about 5 years ago | (#29661603)

Well, consider that they had a board CUSTOM MADE for them, which means custom BIOS fitments, custom feature implementations, custom BUGS Then add the reality that is DRAM - an imperfect 'art' form of data storage and retrieval. No two chips are EXACTLY the same... though very close. Manufacturing defects may not manifest themselves under normal conditions, and require heating/cooling cycles or fluctuating voltages to break down. Running ECC performs a basic parity check, nothing more, and it's still possible to pass bad bits with ECC enabled, just much less likely. The idea is that you can't really test subcomponents individually and have them check out, and then assemble a system and expect it to just 'work'. Some ram is pretty damn finicky. Standards are anything but.

Re:Percentage? (3, Informative)

poetmatt (793785) | about 5 years ago | (#29661961)

uh, article showed that temperature has nothing to do with it.

the rest is accurate.

Re:Percentage? (0, Interesting)

Anonymous Coward | about 5 years ago | (#29661983)

the mobo's used by google are the cheapest boards they can get made. There is NO testing until they hit the datacenter floor. Crap mobo plus poor environment (high heat and vibration + poor power controls) makes for a high failure rate. ECC ram has an odd number of memory chips. The odd chip allows for the parity ram. Google memory has even chip counts since non-ECC ram is much MUCH less expensive. So the bios is custom and carves out ECC function from non-ECC ram

Re:Percentage? (1)

kimvette (919543) | about 5 years ago | (#29662195)

Google's big surprise: each server has its own 12-volt battery to supply power if there's a problem with the main source of electricity. The company also revealed for the first time that since 2005, its data centers have been composed of standard shipping containers--each with 1,160 servers and a power consumption that can reach 250 kilowatts.

I've actually been looking for a 12V power supply for a while. I wonder if they use power supplies off the shelf or if they are custom-manufactured just for Google?

In that case, are these results usefull at all? (1)

pavon (30274) | about 5 years ago | (#29662237)

Which really makes me question whether these results have any validity outside of google. The study found that the majority of errors appeared to be related to the motherboard, but didn't list any information about the motherboards in use. If they are all custom built for google, then there is absolutely no way for any of us to know whether the error rate they exhibited is representative of what you'd get from average COTS server-grade motherboards currently on the market. Thus these results are meaningless to anyone who uses different motherboards, ie everyone but google.

Re:Percentage? (0)

Anonymous Coward | about 5 years ago | (#29661225)

The author has 12 gigs on his mac...

Re:Percentage? (1)

evilviper (135110) | about 5 years ago | (#29662031)

Well haven't Google always used thousands and thousands of normal pc's in their server farms instead of powerful, actual premium server-grade hardware.

No, Google has always used servers. The trademark of Google, which you're misquoting, is the fact that they use clusters of x86 hardware, rather than big iron (mainframes).

Compared to proprietary hardware, x86 servers are dirt cheap.

Re:Percentage? (5, Informative)

gspear (1166721) | about 5 years ago | (#29661171)

From the study's abstract:

"We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."

NOT A PROBLEM HERE !! (0)

Anonymous Coward | about 5 years ago | (#29662005)

I run Linux so I am immune to this. Besides, I don't even use ECC so I have ZERO CE counts, so I am immune to that problem as well.

Re:Percentage? (1)

CAIMLAS (41445) | about 5 years ago | (#29661173)

Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.

Re:Percentage? (4, Insightful)

Tumbleweed (3706) | about 5 years ago | (#29661219)

Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.

Yeah, but let's look at the more common situation - a home. Variable temperatures, most likely QUITE variable power quality, low-quality PSU, and almost certaily no UPS to make up for it. Add that to low-quality commodity components (mobo & RAM).

I'd not be surprised to find the problem much more prevalent in non-datacenter environments.

Switching to high-quality memory, PSU & UPS has made my systems unbelievably reliable the last several years. YMMV, but I doubt by much.

Re:Percentage? (4, Informative)

jasonwc (939262) | about 5 years ago | (#29661345)

The article suggests that errors are less likely on systems with few DIMMS, those which are less heavily used, and that there was no significant difference among types of RAM or vendors, at least with regard to ECC RAM. Thus, laptop and desktop users, who likely only have 2 or 3 DIMMs and make only casual use of their systems have lower risk of errors. ECC RAM may in general be of much higher quality than non-ECC RAM, and thus more prone to error, but its usage is also less mission-critical. In addition, ECC RAM is usually used in systems with many DIMMs that are run 24/7/365.

Good news
The study had several findings that are good news for consumers:

        * Temperature plays little role in errors - just as Google found with disk drives - so heroic cooling isnâ€(TM)t necessary.
        * The problem isnâ€(TM)t getting worse. The latest, most dense generations of DRAM perform as well, error wise, as previous generations.
        * Heavily used systems have more errors - meaning casual users have less to worry about.
        * No significant differences between vendors or DIMM types (DDR1, DDR2 or FB-DIMM). You can buy on price - at least for the ECC-type DIMMS they investigated.
        * Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems.

Re:Percentage? (2, Interesting)

HornWumpus (783565) | about 5 years ago | (#29661513)

IIRC ECC ram has extra bits and hardware to fix any single bit error and record that it happened.

Regular ram only has parity which can tell the MB the data is suspect but not which bit flipped. Kernel panic, Blue Screen, Guru Meditation# whatever.

It's the same RAM, just arranged differently on the DIMM.

I once had a dual Pentium PRO that required ECC RAM. BIOS recorded 0 ECC errors in the three years or so that was my primary machine. Which is what the Google study would lead me to expect.

Re:Percentage? (0, Redundant)

DigiShaman (671371) | about 5 years ago | (#29661753)

Parity is different then ECC. Parity allows the system to detect, but *NOT* correct errors. ECC however, detects and corrects errors. Unless specified, all consumer desktops and laptops contain standard memory (non parity or ECC).

Re:Percentage? (0, Flamebait)

antifoidulus (807088) | about 5 years ago | (#29661841)

Um, thats precisely what the GP said....so yeah, you are pretty representative of the intelligence level of Rush Limbaugh listener.

Re:Percentage? (1)

poetmatt (793785) | about 5 years ago | (#29661997)

eh? I was following this thread and I misread and followed the route of digishaman as well. I'm not defending him, just sometime people fail to read properly when multitasking, myself included.

Re:Percentage? (1)

DigiShaman (671371) | about 5 years ago | (#29662001)

Ok, first of all. You are a retard!!!

Second. Standard memory is ***NOT*** the same thing as parity!

What part of my previous post did you not get?

Re:Percentage? (0)

Anonymous Coward | about 5 years ago | (#29662073)

GP said "IIRC ECC has extra bits and hardware.." and then "It's the same RAM, just arranged differently on the DIMM.". If it is the same then why does it have extra bits and hardware? It appears as though the GP is contradicting themselves. Sounds to me like you are pretty representative of the intelligence level of a Glenn Beck listener.

Re:Percentage? (0)

Anonymous Coward | about 5 years ago | (#29662247)

Who is Glenn Beck?

Re:Percentage? (1)

Extide (1002782) | about 5 years ago | (#29662301)

It's the same RAM chips, but there are just 9 chips on each stick, instead of 8 (or 18 instead of 16). It works just like raid 5.

Re:Percentage? (1)

vadim_t (324782) | about 5 years ago | (#29661615)

* Temperature plays little role in errors - just as Google found with disk drives - so heroic cooling isn'tt necessary.

Talk about a misunderstanding.

First, the paper on hard drives did show that temperature was important. It did show though that too cold is worse than too hot. Also, the data wasn't perfect. Google doesn't have a whole lot of drives running at strange temperatures, since they're a datacenter. A consumer though, might well run a drive at 60C in a badly cooled desktop or laptop, and there's not a single datapoint on Google's graph for that.

In my experience, a drive cooled by a case intake fan runs at about 35c, which comes up as just perfect on google's graph.

The memory paper finds an even bigger effect:

Figure 7 (left) shows that for all platforms higher temper-
atures are correlated with higher correctable error rates. In
fact, for most platforms the correctable error rate increases
by a factor of 3 or more when moving from the lowest to the
highest temperature decile (corresponding to an increase in
temperature by around 20C for Platforms B, C and D and
an increase by slightly more than 10C for Platform A ).

I believe 3x more errors is pretty damn significant, unless you want to adhere to the idea of that a very rare event happening 3 times as often is still very rare, relatively speaking.

But "a mean of 3,751 correctable errors per DIMM per year." sounds rather big to me. Sure, it's a tiny part of 4GB of RAM. But a single bit wrong in exactly the right place could result in things like very unpleasant disk corruption, and most FSes won't like that because they're not designed to compensate for random disk corruption (yeah, I know about ZFS and its checksums, but not everybody runs it)

Re:Percentage? (1)

R2.0 (532027) | about 5 years ago | (#29661911)

I believe 3x more errors is pretty damn significant, unless you want to adhere to the idea of that a very rare event happening 3 times as often is still very rare, relatively speaking.

I believe it depends on scale. If I buy 3 Lotto tickets instead of one, my odds of winning are 3x as much, or 200% larger. But I don't believe anyone would see a reduction from 1:195,249,054 to 1:65,083,018 as "significant" - for all practical purposes, your odds are still "1:a really big number, so don't buy that boat quite yet".

Re:Percentage? (1)

vadim_t (324782) | about 5 years ago | (#29662061)

At 3,751 errors per DIMM/year it means that a system with 2 sticks (very common for dual channel) is getting 20 bits flipped per day. The question then is how long will it take for that to screw up something important.

Since a modern machine has plenty RAM for disk cache, and in many workloads most memory would be dedicated to that, this would easily mean that every day some software operates on data that's not exactly what was on disk, and if you write any significant amount of data back, it's quite possible you're writing the wrong thing as well.

Since the data on disk persists, this means that your data is getting constantly more screwed up.

Re:Percentage? (1)

ByOhTek (1181381) | about 5 years ago | (#29661673)

Switching to high-quality memory, PSU & UPS has made my systems unbelievably reliable the last several years. YMMV, but I doubt by much.

I'll second this. Once or twice I skimped on mobo or memory in a pinch, and those have been the only machines of mine to have stability issues post Windows 98. (Even in Windows 98 I could get about 3 weeks of uptime before needing a reboot. It sucked, but it wasn't as bad as some people had to deal with).

Re:Percentage? (2, Interesting)

Red Flayer (890720) | about 5 years ago | (#29661277)

Humorous ordering of replies to this article.

Your post:

Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.

Post before yours:

From the study's abstract:
"We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."

The 'components bit' of your post may be spot-on, but the juxtaposition of your temperature claim, along with the previous poster's quoting of the abstract FTA, is funny (to me, anyway).

Re:Percentage? (1)

Aneurysm (680045) | about 5 years ago | (#29661211)

I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.

Except it depends on how the modules were originally tested. The study is saying that they break more than previously thought, rather than they break a lot. If they were originally tested in a stressed system similar to Googles and Google is finding that they have far more errors than they should then their study is still valid.

Re:Percentage? (1)

SnarfQuest (469614) | about 5 years ago | (#29661611)

To make it easier to comprehend, 3751 per DIMM per year means that you are getting about 10 errors per day per memory stick. Mosy machines have at least 2 sticks, so that is 20 errors per day. Since you probably don't have error correcting built into your machine, that means those 20 errors actually cause something wrong to happen in your machine. You can hope it's causing the screensaver problems, but it can be doing something very bad to you.

Re:Percentage? (3, Insightful)

The Archon V2.0 (782634) | about 5 years ago | (#29661921)

"a mean of 3,751 correctable errors per DIMM per year."

Hey, the ECC did its job! Let's all go home.

I'm much to lazy to do the math.

I tried, based on the abstract. Wound up getting a figure of 8% of 2 gigabyte systems having 10 RAM failures per hour and the other 92% being just peachy. While a few bits going south is AFAIK the most common failure state for RAM, some of those RAM sticks must be complete no-POST duds and some are errors-up-the-wazoo massive swaths of RAM corrupted, so that throws my back of the envelope math WAY off....

In other words, big numbers make Gronk head hurt. Gronk go make fire. Gronk go make boat. Gronk go make fire-in-a-boat. Gronk no happy with fire-in-a-boat. Boat no work, and fire no work, all at same time.

Sorry, lost my thread there. So yeah, complex numbers, hard math, random assumptions that bugger our conclusions and maybe bugger theirs.

The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.

The problem with something like this is the assumption that Google world == real world.

This RAM is all running on custom Google boards that no one else has access to, with custom power supplies in custom cases in custom storage units. To the researchers' credit, they split things by platform later on, but that just means Google-custom-jobbie-1 and Google-custom-jobbie-2, not Intel board/Asus board/Gigabyte board. Without listing the platforms down to chipsets and CPU types (not gonna happen), it's hard to compare data and check methodology.

While Google is the only place you're going to find literal metric tons of RAM to play with, the common factor that it's all Google might be throwing the numbers. At least some confirmation that these numbers hold at someone else's data center would be nice.

But then, I didn't RTWholeFA, so maybe I missed something.

Re:Percentage? (1)

clarkn0va (807617) | about 5 years ago | (#29661965)

It says in the article that the study found temperature not to be a factor.

Re:Percentage? (1)

Runaway1956 (1322357) | about 5 years ago | (#29662205)

Yes, I saw that, and it was also pointed out earlier in this discussion. I, for one, am not willing to accept that statement. It should be noted that a lot of "assumptions" were made in this study, and that those assumptions are referred to throughout the TFA and the PDF. Of all the hardware errors I've ever dealt with, heat was the most common problem.

Gentoo?? (1, Funny)

Anonymous Coward | about 5 years ago | (#29661091)

I use Gentoo; how does this affect me?

Re:Gentoo?? (4, Funny)

Runaway1956 (1322357) | about 5 years ago | (#29661331)

I would suspect that it has no bearing on you at all. Simply chanting "Gentoo Gentoo Gentoo" should cure any and all hardware errors. You're safe, AC.

I'll keep this fool occupied, someone go call the guys in white coats for me.

Re:Gentoo?? (0)

Anonymous Coward | about 5 years ago | (#29662065)

I think the idiot here is you and whoever voted troll.

Compiling is affected by HW errors.
So Gentoo is more at danger than others.

His question is to what degree.

Re:Gentoo?? (3, Funny)

Anonymous Coward | about 5 years ago | (#29661533)

If you use Gentoo, you'll have to make your own DRAM from the schematics.

Re:Gentoo?? (1)

K. S. Kyosuke (729550) | about 5 years ago | (#29661587)

It means that you have a custom kernel with a random bug exclusively designed for you. Limited series only!

ZFS (0)

Hatta (162192) | about 5 years ago | (#29661117)

This really makes me want to use ZFS.

Re:ZFS (4, Insightful)

jfengel (409917) | about 5 years ago | (#29661203)

Changing your file system solves RAM errors how?

Re:ZFS (1, Funny)

Anonymous Coward | about 5 years ago | (#29661233)

it reduces the effects of universal entropy, obviously.

Re:ZFS (1, Funny)

Anonymous Coward | about 5 years ago | (#29662055)

Are you kidding?! It's 100x greater than we thought!!

Re:ZFS (1)

The Archon V2.0 (782634) | about 5 years ago | (#29662067)

it reduces the effects of universal entropy, obviously.

Sorry, you're looking for the thread two doors over, "Universe Has 100x More Entropy Than We Thought"

Re:ZFS (0)

Anonymous Coward | about 5 years ago | (#29661273)

Corrupted data in memory gets written to disk files. A file system that will let you roll back files when an error is detected will mitigate that situation.

Re:ZFS (2, Informative)

fuzzyfuzzyfungus (1223518) | about 5 years ago | (#29661287)

Just as likely to crash, less likely to silently scribble bits of nonsense all over the filesystem before doing so...

Obviously, not having RAM errors would be even nicer; but, if you can at least detect trouble when it arises rather than well afterwords, you can avoid having it propagate further, and get away with using cheap redundancy instead of expensive perfection.

Re:ZFS (2, Interesting)

profplump (309017) | about 5 years ago | (#29661439)

Adding checksumming adds another place for errors to occur though -- if data is written correctly but the checksum is-miscalculated, either before it is stored or when the data is being verified -- you'll end up throwing out perfectly good data. If you also have redundancy you're probably willing to live with that, but if you're running on single disk ZFS is just adding more opportunities for data corruption in RAM.

Re:ZFS (0)

Anonymous Coward | about 5 years ago | (#29661337)

Swap?

Re:ZFS (0)

Anonymous Coward | about 5 years ago | (#29661515)

And why would you choose ZFS anyway? ReiserFS makes errors disappear.

Re:ZFS (2, Funny)

K. S. Kyosuke (729550) | about 5 years ago | (#29661629)

It makes the problem magically go away by redirecting his attention to a catchy new gadget.

Yippee (-1, Troll)

Anonymous Coward | about 5 years ago | (#29661153)

More kdawson FUD! Yay!!

Not that many (0)

Anonymous Coward | about 5 years ago | (#29661159)

I nmver havm any DRIM error{ on my comp}ter6

Niggers (-1, Troll)

Anonymous Coward | about 5 years ago | (#29661237)

Niggers are sub-human piles of shit. They are too worthless to even be called slaves, but to be PC and all that, they can still pick my cotton.

Bus errors! (5, Informative)

redelm (54142) | about 5 years ago | (#29661251)

Hard DRAM errors are rather hard to explain if the cells are good -- maybe a bad write. After much DRAM testing (I use memtest86+ weeklong), I've yet to find bad cells.

What I have seen (and generated) is the occasional (2-3/day) bus error with specific (nasty) datapatterns. Usually at a few addr. I write that off to mobo trace design and crosstalk between the signals. Failing to round the corners sufficiently, or leaving spurs is the likely problem. I think Hypertransport is a balanced design (push-pull differential like ethernet) and should be less succeptible.

Re:Bus errors! (2, Informative)

marcansoft (727665) | about 5 years ago | (#29661327)

I had a RAM stick (256MB DDR I think) with a stuck bit once. At first I just noticed a few odd kernel panics, but then I got a syntax error in a system Perl script. One letter had changed from lowercase to uppercase. That's when I ran memtest86 and found the culprit.

At the time, a "mark pages of memory bad" patch for the kernel did the trick and I happily used that borked stick for a year or so.

Re:Bus errors! (2, Insightful)

CastrTroy (595695) | about 5 years ago | (#29661643)

I find that more often then not, when people get blue screens or frequent crashes, that it's due to a bad RAM chip. I think it's kind of a bad thing that most motherboards don't really test the RAM when you book up. Usually running the real RAM test will pick up on most memory errors. You don't even need to run memtest. Sure you save a few seconds on boot up, but it's often better to know there is a problem with your memory then go on for months thinking there is some other problem.

Re:Bus errors! (1)

SuperQ (431) | about 5 years ago | (#29662387)

Yup, I'm really disappointed that consumer PCs are still lacking ECC ram. The support for it is in all the chipsets, but it adds $5 to the cost of the machines. Oh well.

Re:Bus errors! (2, Interesting)

dotgain (630123) | about 5 years ago | (#29661745)

I had one mobo, can't remember brand/model exactly but CPU was an AMD K6-2 450MHz, and back then we ran XFree86 which came as seven gzipped-tarballs (if you compile from source). I think it was file number three that would never gunzip on my PC, "invalid compressed data - CRC error", but the MD5 checked out, so I tried it on another machine and it was indeed fine.(and this is back when MD5 was thought secure)

This machine compiled a lot of source (it was a Gentoo box), so surely if errors like these had been happening frequently we'd have known from heaps of signal-elevens killing the compiles all the time, right?

~24 hours of Memtest86 revealed nothing. Googling at the time found someone with the exact same mobo+CPU having problems gunzipping the exact same file (with the correct MD5), and I wondered if there was some specific bit-pattern in the file (or gunzip's state) that b0rked on my mobo. In retrospect I should have tried Solaris x86 on the same machine to try gunzipping the file.

Re:Bus errors! (0)

Anonymous Coward | about 5 years ago | (#29661485)

You are more likely to have cross talk and SI (Signal Integrity) problems with memory than chip level communication.
Hyper-transport is not going to save you when the memory bus is single-ended.

- Point to point connections between CPU and North Bridge (on older CPU) is as good as it can get as far as SI is concerned.
- Memory module are are bussed together i.e. point to multi-point. Signals crossing connector where you can have crosstalk issue because of non-idea grounding forcing crosstalk.
- Memory on modules are not buffered nor routed point to point. They are relying on a balance branch topology. If the loading is unbalanced, then the topology is broken. ;P

What might matter is how the board layout is done, decoupling and circuit design on termination and on-board power supply. All of these would affect the eye opening on the signals and noise that the system would tolerate.

Re:Bus errors! (1)

afidel (530433) | about 5 years ago | (#29661869)

That's why Nehalem server boards have ECC on the busses just like real servers have had since forever.

Re:Bus errors! (1)

Cyberax (705495) | about 5 years ago | (#29661929)

Some hard errors occur because of natural alpha-decay - even one alpha particle can flip a bit. Also, energetic cosmic rays can cause problems.

An explanation (1, Funny)

Anonymous Coward | about 5 years ago | (#29661271)

Maybe this is explainable by today's story that the universe has 100x more entropy than we thought [slashdot.org]

Not posqible? (0)

Anonymous Coward | about 5 years ago | (#29661291)

Bad memory bits? How is thap posqible?

And here I thought people were swearing at me (%^&%$&) in email when it really was just bad bits. Whew... What a relief!

ECC on a home system? (4, Interesting)

eison (56778) | about 5 years ago | (#29661301)

I've always thought it would be a nice-to-have feature for my home system to have ECC - perhaps it might degrade over time and misbehave less if it could detect and fix some errors. But my normal sources don't seem to stock many choices. E.g. Newegg appears to have 2 motherboards to choose from, both for AMD CPUs, nothing for Intel. Frys appears to have one, same, AMD only. Is this just the way things are, or do I need to be looking somewhere else? Would picking one of these motherboards end up in not working out well for my gaming rig?

Re:ECC on a home system? (1)

binarylarry (1338699) | about 5 years ago | (#29661357)

ECC is slightly slower.

Re:ECC on a home system? (2, Informative)

vadim_t (324782) | about 5 years ago | (#29661719)

ECC is slower by something like 1%, which is completely unnoticeable since RAM contributes relatively little to the overall system performance. 2x faster RAM won't make things run twice as fast, because normally CPU caches get a > 90% hit ratio. Otherwise things would be incredibly slow, as the fastest RAM still is horribly slow and has a horrible latency compared to the cache.

Re:ECC on a home system? (0)

Anonymous Coward | about 5 years ago | (#29661445)

I used to use ECC for my home systems, but they've segmented the market so completely it's now virtually impossible to get it unless you suck it up and pay "workstation" prices. Sorry, I don't need 8 cores and 16GB of RAM.

Long for the good ol days of the trusty Intel 440BX. :(

Re:ECC on a home system? (1)

conureman (748753) | about 5 years ago | (#29662285)

I guess it's gettin' pretty long in the tooth, but my favorite home board is a one-socket opteron. It's only got four gigs of RAM though (and two empty slots).
What I learned from TFA is I didn't do anything but piss everyone off with the "heroic" cooling I've been doing all these years. I've never lost a HDD, and I've always blamed the wind tunnel factor. Live and learn, eh?

Re:ECC on a home system? (1)

swb (14022) | about 5 years ago | (#29661467)

You'd probably have to look at server boards rather than desktop boards.

http://bit.ly/16EUiC [bit.ly]

Link to Newegg with filtered set of ECC compatible server boards.

But you'll pay a lot more and probably need a larger case and a bunch of other BS, although it looks like there are some ATX factor boards.

Re:ECC on a home system? (0)

Anonymous Coward | about 5 years ago | (#29661491)

I am almost positive that the latest intel desktop chips the i7's do not support ecc memory. The older core 2 quads do though. Last time I checked newegg had a supermicro board for the 775 socket intel quad cores. The motherboard I am speaking of is a supermicro and has a supports 1333 speed ddr3. Might be a nice way to build a low cost intel single socket quad core ecc unbuffered memory server.

Re:ECC on a home system? (1)

GiMP (10923) | about 5 years ago | (#29662291)

You're right, the i7 does not support ECC. You need to instead run a Lynnfield or Bloomfield Xeon processor, which are as i7, based on Nehalem.

Re:ECC on a home system? (1)

Spoke (6112) | about 5 years ago | (#29661521)

Because AMD Athlon/Phenom CPUs have the memory controller integrated into the CPU, the CPU (not the motherboard) actually dictates what type of RAM you can use.

For all the desktop class AMD Athlon/Phenom CPUs, you can use un-buffered ECC memory. Just make sure it's not buffered or registered. You need an Opteron to use buffered or registered memory.

If you want an Intel processor, you have to use a Xeon (and the right mobo) to use ECC memory.

Re:ECC on a home system? (1)

UserChrisCanter4 (464072) | about 5 years ago | (#29661529)

ECC is a server-targeted feature. Newegg has 18 mainboards that support ECC listed in the Dual LGA 1366 category alone, and I'd imagine plenty more scattered throughout their server board categories.

As you've already discovered, though, it's not terribly common on home-targeted boards. You're welcome to use one of those boards for gaming, but you'll probably have to use a pricier Xeon or Opteron processor, more expensive ECC RAM, and suffer with slower PCI-E links for your videocards. Higher prices and similar or slower gaming performance is probably not what you're interested in.

You'll also have to assume that a bit will flip in an area of RAM that's actually holding information that's important at the moment that bit flips; it's a useless feature if nothing's in the bit of RAM that accidentally flips. It's extremely useful on servers that are on 24/7, always stressed, and likely to have the RAM completely filled with important information. For home users, it falls on the wrong side of the cost/benefit test.

Re:ECC on a home system? (0)

Anonymous Coward | about 5 years ago | (#29661541)

You need to look at Tyan motherboards: http://www.tyan.com/

They have been making less "desktop" boards of the years, but the lines between server, desktop, and workstation are awfully blurry these days.

Re:ECC on a home system? (4, Informative)

DAldredge (2353) | about 5 years ago | (#29661659)

A lot of the AMD boards support ECC RAM but newegg doesn't show it. Most every AM2 motherboard supports it. My main workstation at home is a Phenom II with 8GB ECC RAM mainly for that reason.

gulff (-1, Troll)

Anonymous Coward | about 5 years ago | (#29661377)

This article caused me to poor hot grits down my pants.

Re:gulff (0)

Anonymous Coward | about 5 years ago | (#29661441)

How sick is it, that I contemplated moderating this as funny? The old troll-eske meme were so innocent, I really miss them.

... and Natalie Portman.

Re:gulff (0, Offtopic)

K. S. Kyosuke (729550) | about 5 years ago | (#29661693)

Well, hot grits have never been too rich, not even those you have been keeping in your pants all the time.

Dell (5, Interesting)

^_^x (178540) | about 5 years ago | (#29661389)

In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...

Re:Dell (5, Interesting)

Jah-Wren Ryel (80510) | about 5 years ago | (#29661723)

Hyundai is Hynix and they are the second largest DRAM manufacturer by marketshare (roughly 20% second to Samsung's 30%).

Its no surprise that you've only seen Hynix brand fail in Dells, chances are they are in 90%+ of Dell (and HP and Apple) boxes because they primarily buy from Hynix in the first place. Its selection bias.

Re:Dell (1)

wiredlogic (135348) | about 5 years ago | (#29661791)

Hynix is the former Hyundai Electronics.

Re:Dell (1)

noidentity (188756) | about 5 years ago | (#29661797)

If Hynix is simply used in most RAM modules, then even with the same failure rate you're more likely to find that brand in failed modules. And if your sample is small enough, you could easily find no other brand in failed modules.

Google's system validation vs standard validation (0)

Anonymous Coward | about 5 years ago | (#29661403)

Isn't google running many servers in unusually higher ambient temperatures and in very uniform custom configurations? The results may not apply to anybody else.

I thought that an inability to recall events (3, Funny)

bugs2squash (1132591) | about 5 years ago | (#29661409)

was only a problem for government computers.

Re:I thought that an inability to recall events (1)

rrhal (88665) | about 5 years ago | (#29661527)

And all this time we blamed Microsoft ...

Misleading, to say the very least. (5, Interesting)

jhfry (829244) | about 5 years ago | (#29661453)

Read the article and remember they are talking averages here.

They give it away with this line:

Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems

Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.

Also this was pretty telling:

Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.

And this:

For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.

So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.

I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.

What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.

Re:Misleading, to say the very least. (1)

PRMan (959735) | about 5 years ago | (#29662057)

I do remember reading an article where I was surprised that Google used such low-quality cheap hardware...

That being said, this isn't really that surprising. Like another poster said, once I started buying quality motherboards (Asus) and quality RAM brands, I really haven't had any problems.

"RAID"-style system for RAM... (4, Interesting)

MattRog (527508) | about 5 years ago | (#29661483)

RAM is dirt cheap and most server systems support significantly more RAM than most people bother to install. For critical systems, ECC works but that doesn't prevent everything (double bit errors etc.). Is it time for a Redundant Array of Inexpensive DIMMs? Many HA servers now support Memory Mirroring (aka RAID-1 http://www.rackaid.com/resources/rackaid-blog/server-dysfunction/memory_mirroring_to_the_rescue/ [rackaid.com] ) but should there be more research into different RAID levels for memory (RAID5-6, 10, etc?)

Want to confirm? Look at your bittorrent log. (4, Interesting)

sshir (623215) | about 5 years ago | (#29661627)

Seriously. If you download a lot, and I do, you see quite a few checksum mismatches in the log.
Especially if the torrent is old. Some of them may be sabotage activity, but I doubt that, considering kind of files.

They are not transmission errors: TCP-IP checks for that. Not hard drive errors - again checksums. They can be intrasystem transmission errors though.

I remember folks who did complete checkers wrote that they had a lot of them too.

Re:Want to confirm? Look at your bittorrent log. (4, Interesting)

rdebath (884132) | about 5 years ago | (#29662421)

The TCP/IP checksums are really weak, only 16bits and rather a poor algorithm anyway. So more than one in 65 thousand errors will be undetected by a TCP/IP checksum. And that's not including buggy network adaptors and drivers that 'fix' or ignore the checksums.

If you're transferring gigabytes of data you really need something a lot better.

Still that's probably not the most common source of errors. You see the same problem exists when data is transferred across an IDE or SCSI bus if there's a checksum at all it's very weak and the amounts of data transferred across a disk bus are scary.

Re:Want to confirm? Look at your bittorrent log. (1)

pavon (30274) | about 5 years ago | (#29662425)

That's interesting. If you were checking with a newer version of uTorrent, you may have been using UDP, and not TCP. They added UDP capability about a year ago, and I assume others have as well. I don't know if they do error correction on a per-packet basis or rely on block checksums.

Corsair (0)

Anonymous Coward | about 5 years ago | (#29661641)

I had a pair of Corsair sticks that caused me months of grief. I would get kernel panics that gave absolutely no indication that memory was to blame, and memory tests and stress tests were never able to reproduce the problem. After 9 months I decided to try ignoring all indications that something else was wrong and bought new RAM. Sure enough, 12 months since then, and I haven't had a single problem. I suspect it's an issue to do with timing more than the storage medium itself, which supports Google's theory that it's often caused by bad motherboard design.

Radiation Effects (4, Interesting)

Maximum Prophet (716608) | about 5 years ago | (#29662023)

At Purdue, many years ago, one of the engineers mapped the ECC RAM errors in a room with hundreds of sparc stations and found that it was mostly in a cone shape pointed toward the window. That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.

Re:Radiation Effects (1)

imsabbel (611519) | about 5 years ago | (#29662135)

Well.
Bullshit.

Sorry, but true. Look up alpha radiation if you want to know why.

clearly not a radiation engineer (5, Insightful)

SuperBanana (662181) | about 5 years ago | (#29662191)

That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.

Alpha radiation is stopped by a sheet of office paper. It certainly wouldn't make it through the window, through the machine case, electromagnetic shield, circuit board, chip case, and into the silicon. Even beta radiation would be unlikely to make it that far.

What is much more likely: thermal effects. IE, infrared from the sun heating up machines near the window.

Mainboards (1)

conureman (748753) | about 5 years ago | (#29662047)

Alrighty then, which mainboards have the lowest error rates? TFA seems to have obfuscated that. That's MSs job, I thought Google was supposed to Do No Evile?

Error (0)

Anonymous Coward | about 5 years ago | (#29662119)

Um, what was the topic again? My memory isn't what it used to be.

Lessons learned from *Non* ECC RAM (3, Insightful)

Rashkae (59673) | about 5 years ago | (#29662395)

My takeaway from this paper is that maybe google should hire more technicians who are experienced with non-ecc ram systems. They even believed, prior to this study, that soft errors were the most common error state. I could have told you from the start that was bunk. In over 15 years of burn-in tests as part of pc maintenance, the number of soft-errors observed is... 0. Either the hardware can make it through the test with no error, or there is a DIMM that will produce several errors over a 24 hour test. This doesn't mean that random soft errors never happen when I'm not looking/testing, but the 'conventional wisdom' that soft errors are the predominant memory error doesn't even pass the laugh test.

From looking at the numbers on this report, I get the feeling that hardware vendors are using ECC as an excuse to overlook flaws on flaky hardware. I would now be really interested in a study that compares the real world reliability of ECC vs non-ECC hardware that has been properly QC'd. I'll wager the results would be very interesting, even of ECC still proves itself worth the extra money.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?