Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

To ECC Or Not To ECC?

Cliff posted more than 12 years ago | from the buying-the-best-RAM-for-the-buck dept.

News 46

MetaHiro asks: "I'm going to be upgrading my system in a couple of weeks. I've been looking around the net for reviews and/or benchmarks for ECC vs. non-ECC in both speed and whether or not it's worth it to shell out the extra bucks for ECC. I'm also wondering whether or not i should buy PC2100 ECC instead of PC 2700 non-ECC ram or wait until PC2700 ECC becomes available."

cancel ×

46 comments

Depends (2, Informative)

linuxator (529956) | more than 12 years ago | (#3424134)

For what are you using your system? If it's just another gaming PC then ECC isn't worth it.

Re:Depends (1)

linzeal (197905) | more than 12 years ago | (#3424139)

Unless it is a mission critical server the speed penalty + money penalty is not worth it.

Re:Depends (1)

ClickNMix (218488) | more than 12 years ago | (#3424258)

I'd agree with with that.

From what I know of ECC stuff, unless your business requires you to have 100% uptime or some such, (And there for need to get every last bit of protection you can) your just throwing your money away.

But, if your interested in getting every last inch(?) of power/speed you can, I suppose you need to look at the cost to speed ratio. But my gut instinct is to say get more rather then slightly better memory.

Re:Depends (0)

Anonymous Coward | more than 12 years ago | (#3424751)

Everyone seems to be acting on the assumption that ECC is expensive, but the last time I bought PC133 ECC SDRAM, it was like $2 more than the normal stuff. Seems worth it to me to have less crashes.

The real cost is probably in the mobos -- the cheap stuff doesn't support ECC, and there's some mid-level stuff that "supports" ECC by detecting it but not actually correcting memory errors (Abit?)

ECC apparently slows down your system 5% or so -- not enough to notice, but if you are 0vercl0x1ng your box, you probably wouldn't bother.

Re:Depends (0)

Anonymous Coward | more than 12 years ago | (#3424468)

The system will mainly be used to download divx movies, and Rocco and BangBus movies. Perhaps some Kobi Tai. But in any case, a whole lot of hedonistic un-Lutheran like activities.

I can't understand ECC (4, Informative)

EricLivingston (162103) | more than 12 years ago | (#3424252)

I've got PC2100 ECC in my server at home, and I've turned ECC checking off in the BIOS. What I can't fathom is this: In the past several months I've gotten a couple of parity errors in my memory. However, instead of warning me in some way and allowing me to gracefully shut down, the error raises a non-maskable interrupt which halts the machine in its tracks, giving me a Blue Screen of Death and requiring a hard reset.

How is this helpful? The philosophy behind that seems to be rather than allow my programs to continue with a corrupt bit of data, it's better to halt all operation and LOSE ALL MY DATA and perhaps corrupt my hard drive. That's "help" I don't need.

Is this universal, or just my OS (W2K), BIOS, or hardware? Is there a way for ECC to simply and calmly report a problem without locking up my machine in the process?

Re:I can't understand ECC (3, Informative)

geirt (55254) | more than 12 years ago | (#3424420)

Because you turned off the ECC checking ....

The idea with ECC is that the ECC controller (a peace of hardware between the CPU and the RAM) should detect the bit error and correct it "on the fly", so that the application should not be affected by the bit error at all. You get a speed penalty by doing ECC, because the ECC controller have to calculate a check sum for every write, and check the check sum for every read. Even worse, the ECC check sum is block based, so the ECC controller have to read the whole block to calculate the check sum even if the CPU only reads a single byte. The same goes for writes. To use ECC you need special RAM which is 72 bit wide instead of 64 bit, the extra bits are used for the check sum. This also explains why ECC RAM is more expensive that non ECC ram.

Re:I can't understand ECC (3, Informative)

tzanger (1575) | more than 12 years ago | (#3424512)

You get a speed penalty by doing ECC, because the ECC controller have to calculate a check sum for every write, and check the check sum for every read. Even worse, the ECC check sum is block based, so the ECC controller have to read the whole block to calculate the check sum even if the CPU only reads a single byte.

It's my understanding that ECC works on the smallest data width available, which is 64 bits on SDRAM anyway. When your P4/2.1G grabs a single byte and there's a cache miss, it fetches the entire row (8 bytes) from RAM anyway. Block-based, yes, but no bigger than normal.

As for speed performance, give me a break. The RAM is zillions of dollars more expensive not only because of the extra memory cells but also because (IIRC) the code generation is done on-chip. The checking is done by the chipset, IIRC, and that is all done in hardware and I'm willing to bet in the same amount of time that you can do a normal read in anyway. I haven't been able to google up performance benchmarks on it though.

Why the original poster was getting parity errors was not only because he had ECC checking turned off in BIOS, but also because ECC cannot correct all bit errors. Most ECC checks can correct single-bit errors, but double-bit and higher errors cannot be corrected since enough correction code is not stored. It's just like hard drives, CDROMs and DVDs: they can spot and correct incredible amounts of relatively small errors without telling anybody about it, but if the error is just too large it has to pass back a "bad data" message.

In fact some of the only reasons that we have such huge amounts of storage in formats that we can actually paw up with our hands is because there is such a vast amount of error correction coding on the media.

Re:I can't understand ECC (3, Informative)

geirt (55254) | more than 12 years ago | (#3424631)

tzanger wrote:
> It's my understanding that ECC works on the smallest data width available, which is 64 bits on SDRAM
> anyway. When your P4/2.1G grabs a single byte and there's a cache miss, it fetches the entire row
> (8 bytes) from RAM anyway. Block-based, yes, but no bigger than normal.

Wrong. SDRAM is 64 bit wide, but can be written in units of 8 bit. A read is always 64 bit wide. A single 8 bit write to ECC RAM becomes a 64 bit read, check ECC, update byte to be written, recalculate ECC, write. So you are going to get a small speed penalty compared to the non ECC case of just a single 8 bit wide write.

The penalty is small? (1)

Futurepower(R) (558542) | more than 12 years ago | (#3424737)


My understanding is that the speed penalty is very small. Is that correct?

Re:I can't understand ECC (2, Informative)

Spamboi (179761) | more than 12 years ago | (#3425774)

The thing you're missing is that in a modern system with cacheing, single-byte writes to RAM almost never happen. The CPU/cache system gathers all writes in its caches, and only writes out data from the caches when it either needs to make room in the cache for other data or is explicitly told to do so by the operating system.

Since all writes are from the cache, all the writes to memory are of cachelines (usually 16 or 32 bytes), so the memory subsystem just does a write, not a read-modify-write.

Andy

Re:I can't understand ECC (1)

jhines (82154) | more than 12 years ago | (#3424449)

ECC is Error correcting memory, it has the ability to fix the memory error (single bit) on the fly.

I worked in a VAX shop a while ago, which had ECC memory. One day we started logging memory errors. Other than the error count going up, we didn't notice it, and could wait till the end of the day to call service, so they would come in after hours.

ECC is the answer to the problem you describe.

Re:I can't understand ECC (0)

Anonymous Coward | more than 12 years ago | (#3424453)

Why do you have an ECC server, but not use the ECC? Do you let the air out of your tires before you drive?

Re:I can't understand ECC (5, Informative)

martyb (196687) | more than 12 years ago | (#3424485)

I've got PC2100 ECC in my server at home, and I've turned ECC checking off in the BIOS. What I can't fathom is this: In the past several months I've gotten a couple of parity errors in my memory. However, instead of warning me in some way and allowing me to gracefully shut down, the error raises a non-maskable interrupt which halts the machine in its tracks, giving me a Blue Screen of Death and requiring a hard reset.

If you turned off ECC, then when there's a single bit error, the parity can detect it, but the ECC is not there to correct it -- and your computer raises an interrupt to flag it. MicroSoft takes that to be a Very Bad Thing and throws up a BSOD. Turn on the ECC and you'll be protected from single bit errors and keep on running. If you're interested, what follows is a brief summary of parity and ECC from my long-ago experience and memory (which does NOT have ECC; so if anything I've written is wrong, I'd apprecate corrections from those with more recent experiece/knowledge!)

But. ECC as implemented on PCs can't fix everything. It was years and years ago, but I once had to write some ECC routines to validate programs read into a diagnostic computer for VAXes and DEC-10s. Like in most things, there's a tradeoff between price, speed, and reliability.

First off, memory with no parity. (For the sake of example, I'll refer to storage units as bytes, but this could just as easily be applied to larger units of storage; e.g. 16 or 32 bit words.) A byte is stored simply as 8 bits. If there is an error writing or reading a bit from memory, there's no indication that anything is wrong. Your programs just keep running with bad values which, if in an instruction, can rapidly cause a crash. If the error is in data, someone's paycheck may be way off. Very Not Good.

Next, let's consider memory with parity. Parity comes in two forms: even parity and odd parity. For the sake of example, say we have "even parity". So, for a byte that contains an odd number of one bits, the parity bit would be set to one. If the byte contains an even number of one bits, then the parity bit would be set to zero. When a byte is read from memory, the parity is computed again and compared against the parity that was written when the byte was originally stored in memory. If the stored parity matches the calculated parity, all is well. If there is a discrepancy, it would raise an interrupt and you get the BSOD. But what happens if there are TWO bits that are in error? They'd cancel out each other in the parity calculation and it would appear things are okay. No BSOD, but things are not right.

Finally, let's look at memory with ECC Although there are various levels of ECC, generally what is implemented in PCs fits the mold of "single error correction, double error detection". So, if there is a single bit error in a memory access, the ECC can detect and fix it. This is done by storing even more bits in addition to the byte to be stored. These bits are computed in such a way that if there is a single bit error, the use of the extra bits can identify and fix> the bit that is in error. If there are two bits that are wrong (which would go unnoticed in the parity scheme) the ECC bits can be used to identify that there are two bits in error.

For the truly paranoid, or where uptime is absolutely mandatory, it's possible to construct ECCs such that any bit errors could be detected and corrected. But, the tradeoff is that it would take a lot more bits and it would take more time to perform these calculations -- on every single memory access. And, of course, it would cost a lot more to have all those extra bits around as well as the circuitry to perform the ECC calculations.

So, if a BSOD is just an incovenience for you, ignore the ECC. You'll get better performance at a lower cost for the memory.
If you're developing accounting programs or some medical application, then the downtime from a BSOD would be a Majorly Bad Event. ECC would protect you from single bit errors and your application would keep running; definitely a Very Good Thing.

In short, unless you're doing something completely out of the ordinary for a home user, just stick with the usual parity-backed memory. Hope this helps!

Re:I can't understand ECC (0)

Anonymous Coward | more than 12 years ago | (#3424571)

> it's possible to construct ECCs such that any bit errors could be detected and corrected

No it's not. Are you assuming that your parity bits can never be in error or something? Take any legal memory configuration, wipe it, and replace it with any other legal memory configuration. You cannot detect this as an error. So the number of bits you can detect error for is limited by the Hamming distance between your legal memory configurations.

Re:I can't understand ECC (0)

Anonymous Coward | more than 12 years ago | (#3425047)

idiot. if you have 128MB of memory on a stick and you store 3 x 128MB chips instead and detect errors which crop up by checking all 3 copies of the same value you can detect any error in any of the bits--including the parity ones. yes its not efficient but it works.

Re:I can't understand ECC (0)

Anonymous Coward | more than 12 years ago | (#3425891)

any "single" error. But two bit errors in the same logical location will be "corrected" to an incorrect value.

Re:I can't understand ECC (0)

Anonymous Coward | more than 12 years ago | (#3426602)

Wow, we're up to two people that understand how error correction works now!

Re:I can't understand ECC - specifics (3, Interesting)

EricLivingston (162103) | more than 12 years ago | (#3424584)

Thank you for that explanation - I think I understand it better, though I'm confused a bit because the two times I've gotten a BSOD were with ECC enabled, which is why I turned it off to see if they would go away...

I've got a Tyan S2460 MB w/ 1Gig PC2100 ECC RAM & 2 1.4Mhz Athlon MPs. It uses PhoenixServer BIOS, and the BIOS gives me these options re: ECC

SERR Signal Condition: (ECC error conditions that SERR# be asserted [sic])

  • None
  • Single Bit
  • Multiple Bit
  • Both

ECC Config: (No ECC, Checking Only, Checking and Correction and Checking, Correction with Scrubbing)

  • Disabled
  • Checking Only
  • Checking and Correction
  • Checking, Correction w/ Scrubbing

So, my question would be: if this is basically a home machine with no mission-critical stuff running on it, but I'd like to get some benefits from my expensive ECC RAM without BSODing, what settings would be best in this scenario? Right now I've got everything turned off (Signal Condition: None, ECC Disabled). Oh, and what the heck is scrubbing?

And yes, I've attempted to RTFM but it's little more than a pamphlet and I can't find good, clear info about this.

Re:I can't understand ECC - specifics (4, Informative)

Detritus (11846) | more than 12 years ago | (#3424757)

You want "Checking, Correction w/ Scrubbing".

Scrubbing detects and corrects memory errors that are in memory addresses that are idle. This prevents correctable errors from turning into uncorrectable errors in sections of memory that are infrequently accessed by the CPU.

Re:I can't understand ECC - specifics (1)

zero-one (79216) | more than 12 years ago | (#3424769)

You might want to try a few searches on groups.google.com for S2460 and the brand of memory you are using. I have a S2460 and I learned the hard way that it is a very fussy board when it comes to what memory you use with it. In particular, Crucial memory can cause problems (although I have now got my board working OK with Crucial memory). The basic rule appears to be that it the memory is not on the Tyan approved list it will be problematic.

Re:I can't understand ECC - specifics (1)

zero-one (79216) | more than 12 years ago | (#3424782)

(Sorry to reply to my own post)

Some other thing you might want to try:
- get the 1.4 version of the BIOS
- set 'maxmem' in boot.ini
- if you have an nVidia GeForce, make sure you have the most recent drivers - some earlier drivers causes a lot of problems

Re:I can't understand ECC - specifics (2)

FreakyGeeky (23009) | more than 12 years ago | (#3425005)

I've got a Tyan Tiger MP too. Funny thing is, I get all kinds of lockups, blue screens, and reboots when ECC is disabled. The only way I can keep my machine running is to turn "Both" and "ECC Scrubbing" on in the BIOS. Hope it helps...

Re:I can't understand ECC - specifics (0)

Anonymous Coward | more than 12 years ago | (#3434084)

Thanks for that tip FreakyGeeky! Before now I could only run linux or Win2K Server without the BSoD blues... But now my 2KPro is going again, and I can even play Ghost Recon without crashing! woo hoo!@

Re:I can't understand ECC - specifics (1, Informative)

Anonymous Coward | more than 12 years ago | (#3425063)

set SERR to None which wont BSOD the machine by raising an error NMI and set ECC to Checking, Correction w/ Scrubbing

Re:I can't understand ECC (0, Troll)

Anonymous Coward | more than 12 years ago | (#3424744)

My girlfriend is pregnant. Will ECC fix that?

Re:I can't understand ECC (1, Informative)

Anonymous Coward | more than 12 years ago | (#3428182)

No, you have to use a redundent array of condoms to protect yourself from that kind of error. There are some drastic measures that can be taken to correct the error, but it may increase your chances of burning in hell.

3400+ Slashdotters Can't Be Wrong... (-1, Offtopic)

cybrpnk (94636) | more than 12 years ago | (#3424924)

Do yourself a favor and check out a new month-old internet site called Serence [serence.com] . Since April 2002 they have had a free / no ads / no spyware download for your Windows desktop called Klipfolio [serence.com] and this thing is great. According to the site statistics 3400+ slashdotters have already downloaded the Slashdot Klip [klipfarm.com] and after joining them today, I can see why. (No, I don't have any personal vested interest in this, I just think it's cool.) The Slashdot Klip stays on your desktop and downloads the XML feed from Andover/Slashdot containing current article headlines, alerting you when there is a new one. Klips from a few dozen other founding news sources [klipfarm.com] with XML newsfeeds are also available in a scrolling, dockable, resizable, skinnable package. In the lower left corner of the previous link you can suggest a new klip feed to Serence you'd like to see - a great thing for you to do, the more sites that use this, the better for all of us. You can even start up your own personal Klip feed [serence.com] ! Rack up your favorite sites in one desktop package and you are really in control...click on a headline, up comes the article, click on the site symbol, up comes the home page. Like any dot-com, Serence's success depends on market penetration and this is one idea I think deserves to be slashdotted so it has a shot at succeeding...

Offtopic? Not really, the article I'm posting under went out as a Slashdot Klip headline. And what good is maxed out karma if not to risk it in spreading the word about a cool new Slashdot feature?

Re:3400+ Slashdotters Can't Be Wrong... (0)

Anonymous Coward | more than 12 years ago | (#3427710)

What the fuck are you smoking? 3400 wrong /.ers is a light day of comments.

Re:I can't understand ECC (5, Informative)

waytoomuchcoffee (263275) | more than 12 years ago | (#3424522)

ECC is NOT parity checking. Parity checking is able to tell if one bit is wrong, and if so, to send a parity error (it keeps an extra bit to check against). However, if can't tell which bit was flipped, so it can't correct it. ECC, on the other hand, CAN tell which bit is bad, and therefore can correct it. It can also detects a two-bit error, but has to send a parity error, because it can't correct them both.

Actually, it's your chipset that reads the data from the memory and sends the parity error, and/or makes the actual corrections in the case of ECC. Even though your ECC is turned off, parity is still active. You chipset is reading the extra bit in your ECC memory, sees it doesn't add up to the rest of the bits, and sends out a parity error. The solution is turn your ECC back on, and they should go away, as it will use the ECC info from your ECC memory to correct it instead (unless they are the much rarer two-bit kind -- if you get these often your memory is probably defective).

Also, to comment on someone else, the older ECC correction slowed your system down by around 5 percent. Recent changes will slow it down 1-2 percent, no big deal.

Re:I can't understand ECC (2)

Detritus (11846) | more than 12 years ago | (#3424914)

The idea behind forcing a BSOD or other fatal error is to prevent the user from processing corrupted data. It is telling you that your computer is broken and to get it fixed.

With ECC, and a properly written operating system, the damage can be handled in a more subtle way. Correctable errors are logged for later maintenance. Uncorrectable errors only kill the application they occur in, if the address of the error is in an application's address space. The operating system can keep a list of memory pages that have generated errors, and not allocate those pages for use in the future, sort of like a bad block list on a hard disk.

ECC where useful / speed compromise (2, Insightful)

j_dot_bomb (560211) | more than 12 years ago | (#3424805)

ECC is useless if you are running something like windows 98 which crashes on its own so much anyway - and would bsod on an ECC error. ECC cant work with Celeron only PII, PIII, PIV which also have their own ECC caches. Something like WINNT/2000/XP it is worth it because they have long uptimes. To the poster who bsod with ECC off- turn it on like the other poster said.
ECC in the old PC133 PC100 etc style memory changed the minimum number of wait states you can have (CAS3 instead of CAS2).

Re:ECC where useful / speed compromise (4, Informative)

Detritus (11846) | more than 12 years ago | (#3424858)

ECC protection of main memory is distinct from ECC protection of CPU cache memory. They are independent. You can have ECC main memory with or without ECC cache. On PCs, the ECC encoder/decoder for cache is on the CPU chip, the ECC encoder/decoder for main memory is part of the chipset.

ECC is worth having (3, Insightful)

Detritus (11846) | more than 12 years ago | (#3424823)

I believe ECC is worth having, even if you are not using the computer to run "mission critical" tasks. Memory problems on a computer without parity or ECC can be very difficult to diagnose. The symptoms may look like a flakey operating system or application, not a hardware problem. I had one computer that would only fail when someone ran the FORTRAN compiler. The symptoms looked exactly like a bug in the FORTRAN compiler. It turned out that one of the DRAM chips had a pattern sensitivity problem that was triggered by the image of the FORTRAN compiler. These kinds of problems can be difficult to detect and fix without hardware support. The memory diagnostics in the power-on self-test in the BIOS will detect hard errors, but not more subtle errors.

Re:ECC is worth having (2, Informative)

slackbp (450197) | more than 12 years ago | (#3425160)

D.J. Bernstein makes a case here [cr.yp.to] on the merits of ECC. And his description [cr.yp.to] of a "standard workstation" shows that ECC memory isn't that much more expensive.

MemTest86 Early, MemTest86 Often (5, Informative)

schmaltz (70977) | more than 12 years ago | (#3426057)

Whether you use parity, non-parity, or even ECC, you should ALWAYS test your RAM sticks with MemTest86 [teresaudio.com] .

Test them when newly purchased (I've received duds from brand-name online memory warehouses.) Test them every few months (they can and do go bad.) Especially test when your computer exhibits otherwise unexplainable behavior, like: Windows BSoD, kernel panics, characters changing themselves on disk willy-nilly, programs crashing for no good reason, or going bad on disk and needing reinstallation. Disk files that go corrupt. Any of the above, even (or especially) when it seems inconsistent, can be caused by a few bad blocks in a RAM stick.

MemTest86 is a program that boots and runs off floppy (has its own boot loader, no OS), and t-h-o-r-o-u-g-h-l-y tests your ram. It even detects adjecent cell errors, where a 1 in cell n can threshold bias the 0 in cell n+1 or n-1 until it is considered a 1.

It even knows how to differentiate between cache memory errors and RAM errors. Just do it (after nightmare hardware problems, MemTest86 showed me what was broken- can't say enough good things about it.) It's user interface could be more informative, but when it spots and error, you'll know.

Re:MemTest86 Early, MemTest86 Often (1)

raptwithal (134137) | more than 12 years ago | (#3429150)

SuSE 7.3 has an option in its default LILO menu to run memtest.

Re:MemTest86 Early, MemTest86 Often (1)

dan_bethe (134253) | more than 12 years ago | (#3441057)

Check out Cerberus [sf.net] . Get the latest cvs version from that site, run './newburn', and let it go for 8-24 hours.

Invented at VA Linux, it became a de facto QA suite for SGI, Redhat, and some of the Linux kernel people. By default, it checks most of your subsystems concurrently which is more realistic and useful than just checking RAM or just checking hard drives, etc. It runs on Linux on ia32, ppc, and maybe others. Its strength can be set as high as you want, and its scripting engine can be set to do anything with any hardware combination and network. At high strength, it's been known to set machines on fire (sorry, no pics), but not at the default settings.

My experiences, memory problems... (-1)

Anonymous Coward | more than 12 years ago | (#3429010)


In my experiences, ECC is not worthwhile. There are too many ways the data can get corrupted before it ever hits the memory stick. ECC only helps if the information is accurately present on the memory data lines attacked the the RAM module, and then only when the RAM module itself fails. Otherwise you are just recording, with error-correction, incorrect data. And lets face it: If the memory module itself is fried, ECC ain't going to help.

Testing: I had some rather painful experiences with a FIC-503+ motherboard. Turned out to have a design defect that caused problems when both DIMM slots were utilized, regardless of the RAM type.

To test it, under linux (of course), with a minimal boot, running as few processes as possible, I created a large file (${FILE}) of non-uniform data by cat'ing (combining) several arbitrary convenient large files. About 2x - 3x the total size of all my RAM. I then did:

repeat 100 cksum ${FILE} | uniq -c

Any problems showed up right away. (Cksum returned different numbers.)

This was a simpler approach, though not quite as good, as the general make 100 linux kernels and diff the make-logs.

You might also look at: http://www.bitwizard.nl/sig11/ [bitwizard.nl]


...Anonymous. Still too lazy to log in...

Re:My experiences, memory problems... (2, Informative)

mikehoskins (177074) | more than 12 years ago | (#3429891)

In my experiences, ECC is not worthwhile. There are too many ways the data can get corrupted before it ever hits the memory stick.


Huh? If ECC isn't worth it, then RAID 5 (the minimum-acceptable "poor man's" form of RAID) certainly isn't, for the same reasons.


If you're correct, then I can say this, using the same logic: "In my experiences, RAID-5 is not worthwhile. There are too many ways the data can get corrupted before it ever hits the disk."

Heck, all the built-in hard drive ECC, SMART technology, sector relocation, CRC-checking, etc. are useless, if we follow your argument to its logical conclusion.


Since ECC and RAID-5 are similar technology and perform similar roles in similar ways, and since RAM is always far more important than disk, at least once the OS is booted, then, ECC is more important than RAID, yet make data centers skip on ECC and spend on RAID. What's silly is that if MEMORY IS CORRUPT, THEN DISK CERTAINLY WILL BE -- PERMANENTLY.


A 1-bit error is the most common kind of memory error and can crop up for a multitude of reasons, including static, voltage spikes, bad motherboard timings, cosmic rays, etc. And, you'll still catch the 2-bit errors, the second most common kind. I'd be willing to bet that 1 and 2 bit errors account for 99+% of all memory errors, unless you got a bad chip. ECC was NEVER designed to fix all errors, just the 99+% we actually encounter.


The thing about some /.'ers is that they could care less about data integrity and care far too much about speed and price. I would never run a productional machine, or a home machine left on for 24 hours a day without backup power and ECC memory; I do this at home. A production box would also require some level of RAID and backup hardware/software.


If you're anti-ECC for ANY reason, then, to follow your logic, you should also be anti-RAID and anti-tape backup.

Re:My experiences, memory problems... (0)

Anonymous Coward | more than 12 years ago | (#3430335)

Funny, I actually remember reading specs for some no-name companies raid5 controller that was used in one of the linux hardware companies boxes. The thing used EDO DRAM without parity or ECC checking. I was like great, RAID 5 to protect my data and shitty cache ram to corrupt it!

Re:My experiences, memory problems... (0)

Anonymous Coward | more than 12 years ago | (#3433894)

Errk? WTF? Your logic is, ahh, most confusing...

Harddrives fail early, frequently, and often. They are complex mechanical devices with moving parts. While they have an expected lifespan of hundreds of thousands of hours, the actual lifespan is random, guaranteeing a considerable number of premature failures. Manufacturing issues further limit the expected lifespan. Quite frankly, I have seen these things fail left and right. In a large installation, you average replacing a certain number of harddrives each day.

RAM, unless it's damaged by static during install, is pretty solid. Same can be said for the motherboards. Now, granted, some fly-by-night company might cut corners and sell you crap. But diagnostics will pick this up pretty quickly.

I seem to recall hearing something about google switching from harddrives to RAM-based drives... Factoring in the replacement costs, the RAM approach was cheaper and significantly more reliable...

Bottom line: RAID allows me to recover the last N years worth of data after my harddrive dies. I've had half a dozen harddrives fail on me, personally, so far. My RAM has never failed. Ever. There was that one issue with my motherboard being defective and not handling 2 DIMMS. ECC would not have made any difference. Even with that RAM problem, I still suffered no data-loss on my harddrive.

Now, in contrast to my RAM successes, I have had my CPU fan fail. The CPU incinerated itself. The Linux kernel actually paniced. But ya know something? After replacing my CPU, I still had all my data, intact, on the harddrive.

My point is: ECC only protects you against failures directly on the memory stick itself. These failures are extraordinarily rare. They are rare compared to your chances of being hit by floods, tornados, lightning, etc. IMHO, the cost exceeds the benefits. And there are far more important issues to worry about.

Re:My experiences, memory problems... (1)

mikehoskins (177074) | more than 12 years ago | (#3473585)

I'd agree this shouldn't have been modded down and they should have let us agree to disagree.


Granted, HD's are far more failure prone than RAM and can be shut off. But so?


I've had RAM die more than once in more than one machine, etc. I've even had ONE MEMORY CELL/WORD DIE! (I've had one of those "support 1400 end-user machines" and 50+ server support jobs, in addition to other places I've worked over the last 9 years.) So, I probably have seen things you've not seen and you've seen things I haven't. Your "in my experience" is anecdotal evidence, not scientific evidence.


No, ECC cannot correct everything. But you're only partially right when you say, "ECC only protects you against failures directly on the memory stick itself." What about power glitches, especially ones that get past the UPS? Often, this IS a 100% correctable error, with ECC. Why do the manufacturers of servers (especially high-end stuff) rely on ECC RAM -- for uptimes? This is especially true of mainframes and high-end Unix servers. Just go ask IBM or HP/COMPAQ....


The fact that maybe 1 in 1,000 or even 1 in 10,000 DIMMs ever fails outside of manufacturer testing vs. 1 in 10 HD's is irrelevent. RAM still fails, plain and simple.


I'd bet that Google uses ECC RAM....


These days, a 256M DDR DIMM w/ ECC is only a few bucks more, for an extra level of safety and only a modest performance loss. Buy ECC, love ECC.


(Use RAID, too -- on servers. Again, RAID 5 and ECC are very similar technologies. If RAID 5, on failure-prone HD's is considered "fault tolerant enough," then why not ECC RAM?)


Again, it's ONLY A FEW BUCKS! Don't believe it's only a few bucks? Price it on http://www.pricewatch.com/ Today's prices (2002-05-06):

256M RAM PC2100: $37

256M RAM PC2100 w/ ECC: $42

512M RAM PC2100: $100

512M RAM PC2100 w/ ECC: $113


Well, that about defates that argument.


Yes, as far as more important issues:

UPSes

Backups

World Hunger

...

Censored: My experiences, memory problems... (0)

Anonymous Coward | more than 12 years ago | (#3440502)


Censorship appears alive and well on Slashdot. Some low-life decided to mark my post down (-1), despite the fact that it actually *IS* relevant to the topic at hand. So I'm reposting...

BTW: NASA uses computers with multiple, redundant CPU's to detect problems. Do you use multiple CPU's as real-time backups? Where do you think a bit is more likely to become corrupted? In the CPU or in the RAM? (Well, neither, unless the heatsink fails...)

---

In my experiences, ECC is not worthwhile. There are too many ways the data can get corrupted before it ever hits the memory stick. ECC only helps if the information is accurately present on the memory data lines attacked the the RAM module, and then only when the RAM module itself fails. Otherwise you are just recording, with error-correction, incorrect data. And lets face it: If the memory module itself is fried, ECC ain't going to help.

Testing: I had some rather painful experiences with a FIC-503+ motherboard. Turned out to have a design defect that caused problems when both DIMM slots were utilized, regardless of the RAM type.

To test it, under linux (of course), with a minimal boot, running as few processes as possible, I created a large file (${FILE}) of non-uniform data by cat'ing (combining) several arbitrary convenient large files. About 2x - 3x the total size of all my RAM. I then did:

repeat 100 cksum ${FILE} | uniq -c

Any problems showed up right away. (Cksum returned different numbers.)

This was a simpler approach, though not quite as good, as the general make 100 linux kernels and diff the make-logs.

You might also look at: http://www.bitwizard.nl/sig11/ [bitwizard.nl]


...Anonymous. Still too lazy to log in...

One or Two Years Ago ... (1)

Rolo Tomasi (538414) | more than 12 years ago | (#3433294)

... I read an article about ECC, in which the point was made that the RAM chips which hold the parity information have a smaller capacity than the main RAM chips, which means the parity chips are of an older design. Now, memory manufacturers have constantly reduced the number of spontaneous errors in each product generation by (IIRC) an order of magnitude. This means that there is a very high probability that the parity chips themselves will be the source of memory errors. This would then lead to the situation of people thinking the ECC had protected them from memory errors, where in reality it was the cause.

So, the conclusion was that parity RAM was justified in the original IBM PC, because RAM errors were really common back then, but such a technology would be obsolete today.

Anyone have more info regarding this?

On ECC and Parity (1, Informative)

Anonymous Coward | more than 12 years ago | (#3433309)

There seems to be a misunderstanding regadring ECC and Parity memory, at least in relation to PC's.
PC memory has either some extra bits (one for every eight bits) for ciclic redundancy, or it hasn't. There is no dedicated ECC circuity on PC memory, (Exept maybe IBM Chipkill memory). The difference between parity memory and ECC memory lies on how the memory controller takes advantage of the extra bits. To get an idea on how ECC really works, see Hamming code [ee.unb.ca] .

Regards
Roberto de Iriarte
roberto at spock dot cl
Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...