Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Everything You Know About Disks Is Wrong

kdawson posted more than 7 years ago | from the mean-time dept.

Data Storage 330

modapi writes "Google's wasn't the best storage paper at FAST '07. Another, more provocative paper looking at real-world results from 100,000 disk drives got the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? The paper crushes a number of (what we now know to be) myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' drive reliability (spoiler: no difference), and RAID 5 assumptions. StorageMojo has a good summary of the paper's key points."

cancel ×

330 comments

Sorry! There are no comments related to the filter you selected.

MTBF (5, Interesting)

seanadams.com (463190) | more than 7 years ago | (#18090970)

MT[TB]F has become a completely BS metric because it is so poorly understood. It only works if your failure rate is linear with respect to time. Even if you test for a stupendously huge period of time, it is still misleading because of the bathtub curve effect. You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.

Suppose a tire manufacturer drove their tires around the block, and then observed that not one of the four tires had gone bald. Could they then claim an enormous MTBF? Of course not, but that is no less absurd than the testing being reported by hard drive manufacturers.

Re:MTBF (5, Informative)

Wilson_6500 (896824) | more than 7 years ago | (#18091058)

Um, but doesn't the summary of the paper say that there is no infant mortality effect, and that failure rates increase with time, and thus the bathtub curve doesn't actually apply?

Infant Mortality and stuff (0, Troll)

jmorris42 (1458) | more than 7 years ago | (#18091362)

> Um, but doesn't the summary of the paper say that there is no infant mortality effect, and that
> failure rates increase with time, and thus the bathtub curve doesn't actually apply?

That may be the new 'theory' but we all know about theory vs reality. Here in reality if you put a couple of dozen new drives into service you have one or two spare hard drives to replace the ones that WILL fail in the first week. Especially with consumer grade drives typical in workstation deployment. If you only have one dud out of twenty it was a good rollout.

And as for some of the other assertions in this paper (well the summary, haven't read this one yet, still wanting to reread the google paper again, need to hours in a day.... bah!).......

> Costly FC and SCSI drives are more reliable than cheap SATA drives.

Sorta. Again, real world vs theory. Try banging the hell out of an off the shelf consumer drive 24/7/365 and see how long it holds up. Yea, thought so. Hope you didn't have anything important on that paperweight.

> RAID 5 is safe because the odds of two drives failing in the same RAID set are so low.

This one should bother ya if you are overly relying on the 'infallibility' of RAID5. Remember kids, drives fail from two major groups of causes, internal and external. If a power event kills one drive in the array the odds are pretty low of only one being dead, you just might not KNOW about #2 yet. And filesystem corruption will be faithfully mirrored onto the array. Obey the 1st Commandment: "Thou Shalt Make Backups."

Re:Infant Mortality and stuff (5, Insightful)

Wilson_6500 (896824) | more than 7 years ago | (#18091420)

That may be the new 'theory' but we all know about theory vs reality.

Uh, but wasn't this data accumulated via testing actual drives? That's... kinda how science works--by replacing anecdotal evidence with scientifically-gathered data. That's basically condemning science in favor of anecdotes--and the medical fields can tell you how well _that_ works.

Re:Infant Mortality and stuff (1)

diersing (679767) | more than 7 years ago | (#18091764)

But then he couldn't finish the day feeling .

Re:Infant Mortality and stuff (1)

pinkstuff (758732) | more than 7 years ago | (#18091460)

But Raid 1 or 5 is what I was going to use for my (home) backup box. Do I now need a backup of that as well, ouch, my wallet is starting to hurt :).

Re:Infant Mortality and stuff (2, Interesting)

DarkVader (121278) | more than 7 years ago | (#18091770)

1 in 20 drive failures? What are you using, Western Digital drives? I don't see anything close to that failure rate, more like 1 in 300.

I don't deploy "enterprise" drives, they're overpriced, and the few I did install years ago proved to be less reliable than "consumer" drives. My real world experience is that the "consumer" drives are generally reliable, I just plan on a 2-3 year replacement schedule.

I can't disagree with RAID being fallible depending on what takes out the drive, though.

I think you misunderstood what the GP said. (-1, Troll)

Anonymous Coward | more than 7 years ago | (#18091664)

...Let me rephrase it for you:

"I'm not particularly smart or knowledgeable, but I am a subscriber and I saw this story in the mysterious future, so I decided to go for +5 with the usual trick: if your post is early, long, and somewhat on topic, people will mod it up by the mere fact that it provides something to read instead of nothing to read."

Re:MTBF (1)

Hubec (28321) | more than 7 years ago | (#18091098)

I'm not saying you're wrong, but how does your statement about the one month infant mortality spike relate to the article's finding that no such spike is observable in the wild?

Re:MTBF? RTFA. (4, Informative)

Vellmont (569020) | more than 7 years ago | (#18091198)

You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.


Well, the article actually says that drives don't have a spike of failures at the beginning. It also says failure rates increase with time. So you're right that MTBF shouldn't be taken for a single drive, since the failure rate at 5 years is going to be much higher than at one.

The other thing that the article claims is that the stated MTBF is simply just wrong. It mentioned a stated MTBF of 1,000,000 hours, and an observed MTBF of 300,000 hours. That's pretty bad. It's also quite interesting that the "enterprise" level drives aren't any better than the consumer level drives.

Re:MTBF (0)

Anonymous Coward | more than 7 years ago | (#18091204)

If on average one out of four drives goes bad immediately when you plug it, it will decrease the MTBF by 25%, not make it ridiculously small or enormous.

I'm amazed by the number of people who set out to show that MTBF is easy to misunderstand, but end up proving it by example.

Re:MTBF (3, Interesting)

gvc (167165) | more than 7 years ago | (#18091252)

MT[TB]F has become a completely BS metric because it is so poorly understood. It only works if your failure rate is linear with respect to time. Even if you test for a stupendously huge period of time, it is still misleading because of the bathtub curve effect. You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.
The simplest model for survival analysis is that the failure rate is constant. That yields an exponential distribution, which I would not characterize as a bell curve. The Weibull distribution more aptly models things (like people and disks) that eventually wear out; i.e. the failure rate increases with time (but not linearly).

With the right model, it is possible to extrapolate life expectancy from a short trial. It is just that the manufacturers have no incentive to tell the truth, so they don't. Vendors never tell the truth unless some standardized measurement is imposed on them.

Re:MTBF (2, Informative)

kidgenius (704962) | more than 7 years ago | (#18091902)

Well, I guess you don't really understand reliability then. You also don't understand MTBF/MTTF (hint: they aren't the same) What they have said is a big "no duh" to anyone in the field. MTTF will work regardless of whether or not your failure rate is linear with time. Also, there are other distribution of failure beyond just exponential, such as the Weibull. Exponential is a subset of the Weibull. Using this distribution you can accurately calculate an MTTF. Now, the MTBF will not match the MTTF initially, but given enough time, it will eventually match the MTTF. All of this information is very useful to anyone that actually knows what to do with those numbers.

Re:MTBF (0)

Anonymous Coward | more than 7 years ago | (#18091940)

The problem with the hard drive MTBF is the way the manufacturers measure and come up with those numbers.

I'm pretty lazy, but you can search around or go here http://forums.storagereview.net/index.php [storagereview.net] to find out how they exaggerate the numbers to ridicules proportions.

Re:MTBF (2, Insightful)

kidgenius (704962) | more than 7 years ago | (#18091962)

I'm also going to add to my statement and mention that the authors of the article do not understand MTTF. They have calculated MTBF, not MTTF. They are not the same. In fact, they have assumed that the drives fail in a random way by doing a simple hours/failures. They need to really to look at failures and suspensions and perform a weibull analysis to see how close their stuff is to the manufacturers stated values.

moving parts (5, Funny)

DogDude (805747) | more than 7 years ago | (#18091024)

Every single mechanism with moving parts will fail. It's just a matter of when. In a few years, when everybody is using solid state drives, people will look back and shake their heads, wondering why we were using spinning magnetic platters to hold all of our critical data for such a long time.

Re:moving parts (2, Interesting)

Nimloth (704789) | more than 7 years ago | (#18091188)

I thought flash memory had a lower read/write cycle expectancy before crapping out?

Re:moving parts (4, Informative)

NMerriam (15122) | more than 7 years ago | (#18091382)

I thought flash memory had a lower read/write cycle expectancy before crapping out?


They do have a limited read/write lifetime for each sector, BUT the controllers automatically distribute data over the least-used sectors (since there's no performance penalty to non-linear storage), and you wind up getting the maximum possible lifetime from well-built solid-state drives (assuming no other failures).

So in practice, the lifetime of modern solid state will be better than spinning disks as long as you aren't reading and writing every sector of the disk on a daily basis.

Re:moving parts (2, Informative)

scoot80 (1017822) | more than 7 years ago | (#18091702)

Flash memory will have about 100,000 write cycles before you will burn it out. As parent mentioned, a controller would write that data to several different locations, at different times, thus increasing the lifetime. What this would mean though is that your flash disk will be considerably bigger then what it can actually hold.

Re:moving parts (1)

tedgyz (515156) | more than 7 years ago | (#18091750)

So is there a MTBF for solid state drives? I'm serious.

Re:moving parts - Don't always wear out (1)

NFN_NLN (633283) | more than 7 years ago | (#18091228)

Machines built at the molecular level can't wear out. There's either enough energy to break the bond at the molecular level or there's not. Just run it within spec. and it'll never break.

Re:moving parts - Don't always wear out (-1, Troll)

Anonymous Coward | more than 7 years ago | (#18091260)

Really? How come you're gonna die? You're so touchingly naive, what are you, 12 years old?

Re:moving parts - Don't always wear out (1)

AndersOSU (873247) | more than 7 years ago | (#18091930)

Yeah I wouldn't worry about quantum effects on machines built at the quantum level either.

Re:moving parts (5, Funny)

theReal-Hp_Sauce (1030010) | more than 7 years ago | (#18091254)

Forget Solid State Drives, soon we'll have Isolinear Chips. It wont matter if they fail or not because as long as the story line supports it Geordie can re-route the power through some other subsystem, Data can move the chips around really quickly, Picard can "make it so", and after it's all over with Wesley can wear a horrible sweater and deliver a really cheese line.

-C

Every single solid state drive will fail too... (2, Informative)

EmbeddedJanitor (597831) | more than 7 years ago | (#18091272)

It is just a matter of time. Depending on the technology (eg. flash) it might be a short to medium time or a long time.

If something has an MTBF of 1 million hours (that's 114 years or so), then you'll be a long time dead before it fails.

At this stage, the only reasonable non-volatile solid state alternative is NAND flash which costs approx 2 cents per MByte ($20/Gbyte) and dropping. NAND flash has far slower transfer speeds than HDD, but is far smaller, uses less power and is mechanically robust. NAND flash typically has a lifetime of 100k erasure cycles and needs special file systems to get robustness and long life.

Re:Every single solid state drive will fail too... (0)

Anonymous Coward | more than 7 years ago | (#18091510)

You fail at understanding what MTBF is.

(hint: MTBF of 1 million hours doesn not mean the average drive will last 1 million hours)

Re:moving parts (1)

brarrr (99867) | more than 7 years ago | (#18091424)

Says you!

I'm going to live forever!

Re:moving parts (0)

Anonymous Coward | more than 7 years ago | (#18091896)

I'm gonna learn how to fly!

Re:moving parts (1)

CastrTroy (595695) | more than 7 years ago | (#18091426)

Unfortunately, we don't have solid state storage that doesn't fail either. I've had more RAM chips die than hard drives. And I know that you aren't suggesting that flash memory doesn't fail. Although I've never had flash memory fail, I've only ever used it for digicams and mp3 players, and not for the kind of usage pattern you would get from a hard drive.

Re:moving parts (1)

mightyQuin (1021045) | more than 7 years ago | (#18092038)

cisco 6900 series routers - flash RAM has been very reliable from my experience...close to a HD-type of demand.

Re:moving parts (4, Informative)

wik (10258) | more than 7 years ago | (#18091610)

Not true. Transistors at really small dimensions (e.g., 32nm and 22nm processes) will experience soft breakdown during (what used to be) normal operational lifetimes. This will be a big problem in microprocessors because of gate oxide breakdown, NBTI, electromigration, and other processes. Even "solid-state" parts have to tolerate current, electric fields, and high thermal conditions and gradually break down, just like mechanical parts. Don't go believing that your storage will be much safer, either.

The Abstract (0)

Anonymous Coward | more than 7 years ago | (#18091026)

Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million.

In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%.

We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF.

We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years.

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks.

Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.

Human MTBF (4, Funny)

EmbeddedJanitor (597831) | more than 7 years ago | (#18091322)

MTBF of a human until gross catastophic failure (ie. death) is approx 50 years which is approx 440,000 hours.

Of course if we count relatively minor failures (like forgetting to take out the trash or pick up dirty underwear), then MTBF is approx 27 minutes!

i'll tell you (2, Interesting)

User 956 (568564) | more than 7 years ago | (#18091042)

Bianca Schroeder, of CMU's Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?

It means I should be storing my important, important data on a service like S3. [amazon.com]

Re:i'll tell you (1)

Taimat (944976) | more than 7 years ago | (#18091114)

I disagree.... I would much rather have control over the drives where my data resides, rather then upload it to a place where I have no idea what they are doing to protect my data. In my opinion, nothing beats a good RAID with current backups. Just because I drive isn't dead, doesn't mean it can't be replaced. After the drive has been in commision a while, replace it, and move it to a server that isn't as critical.

Re:i'll tell you (1)

udderly (890305) | more than 7 years ago | (#18091190)

I always tell my customers that if you don't have an off-site backup, you're really not backed up. Of course, we have an off-site backup service, so take that with a grain of salt. But, like they say with real estate, "location location, location." Except in this case, "redundancy, redundancy, redundancy."

Re:i'll tell you (2)

karnal (22275) | more than 7 years ago | (#18091234)

"redundancy, redundancy, redundancy."

So that Department of Redundancy Department really does something after all!

Re:i'll tell you (2, Funny)

DarkVader (121278) | more than 7 years ago | (#18091852)

And could there be anything funnier that could happen to that comment than it being moderated "Redundant"?

Re:i'll tell you (1, Funny)

Anonymous Coward | more than 7 years ago | (#18092018)

In a perfectly humorous world, everyone would mod it as Underrated, (except the original Redundant mod,) so that it makes it to +5 Redundant.

Oh, I wish I didn't waste my mod points on the Valentine survey.

Re:i'll tell you (0)

Anonymous Coward | more than 7 years ago | (#18091696)

Against that, sometimes you're just not supposed to have the data. Consider the real world case of a site that had no less than five sets of backups, three of them at distinct offsite storages. Come time to recover the data, the two at the local data centre were found to be bad. So they called back the first step. Exposed to magnetic fields; unreadable. They then called back the second set. Water damaged. Third set. Courier had a car crash, tapes were damaged beyond repair ...

"Everything You Know About Disks Is Wrong" (3, Funny)

cookieinc (975574) | more than 7 years ago | (#18091092)

Everything You Know About Disks Is Wrong
Finally, a paper which disspells the common myth that disks are made of boiled candy.

Exactly! (1)

plasmacutter (901737) | more than 7 years ago | (#18091164)

And here i knew disks stored data on a set of rotating platters, i guess its really stored on the alien spacecraft hiding behind hale bopp!

Re:"Everything You Know About Disks Is Wrong" (4, Funny)

egr (932620) | more than 7 years ago | (#18091596)

I've read the article, then the tittle, damn!

Re:"Everything You Know About Disks Is Wrong" (1)

cookieinc (975574) | more than 7 years ago | (#18091778)

I was going to read the article, but the overly assertive title intimidated me. Instead I wept quietly in my cubicle of impending anxiety.

Re:"Everything You Know About Disks Is Wrong" (1)

mrbcs (737902) | more than 7 years ago | (#18091710)

umm So Maxtor drives really aren't that good? /smartass

Everything You Know About Dupes Is Wrong (0, Offtopic)

Jeff DeMaagd (2015) | more than 7 years ago | (#18091996)

I suppose dupes are good!

Dr. Schroeder is pretty hot, too! (1, Offtopic)

yanyan (302849) | more than 7 years ago | (#18091104)

http://www.cs.cmu.edu/~bianca/ [cmu.edu]

I would love to give her my very large hard drive. For "performance evaluation and measurement", you understand. ;-P

Re:Dr. Schroeder is pretty hot, too! (5, Funny)

Anonymous Coward | more than 7 years ago | (#18091144)

Except she requires a MTBF of more than 3 seconds. Sorry dude.

Re:Dr. Schroeder is pretty hot, too! (2, Funny)

gardyloo (512791) | more than 7 years ago | (#18091184)

Except she requires a MTBF of more than 3 seconds. Sorry dude.

      You call that failure?!? I'd call it success.

Re:Dr. Schroeder is pretty hot, too! (2, Insightful)

Anonymous Coward | more than 7 years ago | (#18091236)

A quick look into her lectures/talks in the past:

June 2006 Microsoft Research, Mountain View, CA. Host: Chandu Thekkath. "Understanding failure at scale".

Its okay man.. She will understand..

Re:Dr. Schroeder is pretty hot, too! (1, Offtopic)

inviolet (797804) | more than 7 years ago | (#18091538)

Except she requires a MTBF of more than 3 seconds. Sorry dude.
You call that failure?!? I'd call it success.

MTBF, in this case, means Mean Time Between Farkings. So yeah, three seconds is an astoundingly short refractive period. :)

Re:Dr. Schroeder is pretty hot, too! (-1, Offtopic)

Anonymous Coward | more than 7 years ago | (#18091914)

This isn't Fark, it's Slashdot.

You can say "fuck" here.

Of course, that doesn't make this thread any more or less generally sexist, but hey, at least "fuck" is ok.

Re:Dr. Schroeder is pretty hot, too! (0)

Anonymous Coward | more than 7 years ago | (#18091786)

You know, if I were a hot woman research scientist, I'd be pretty pissed off at how everybody always commented on my looks. On the other hand, if I were a hot woman research scientist, I'd probably be unemployed, what with spending all day at home in front of the mirror.

Re:Dr. Schroeder is pretty hot, too! (1)

mgabrys_sf (951552) | more than 7 years ago | (#18091822)

I'm surpised that didn't make it into the summary. Let's rewrite the headline please:

"Hot babe researcher has this to say about hard drives - oh momma!"

I think we'd see a bit more comments in the thread don't you think? MARKETING - you need to think about MARKETING!

Amazing! (2, Insightful)

Dr. Eggman (932300) | more than 7 years ago | (#18091118)

You mean to tell me these people have found hard drives that don't fail beyond repair by the end of the first year? I've never encountered a HD that has done this, much to the despare of my wallet. Now, I am serious, what is wrong with the harddrives I choose that kills them so quickly? Is Western Digital no longer a good manufacturer? Should I maybe not run a virus check nightly and a disk defrag weekly? Is 6.5GB of virtual memory too much to ask? Of course not, the manufacturers are just making crappier hds. This article has told me one thing: it's time to get a RAID setup. I've been looking at RAID 5, but two things still trouble me, the price and the performance hit. Does anyone have any information on just how much a performance hit I might experience if I have to access the HD a lot?

Re:Amazing! (1)

CastrTroy (595695) | more than 7 years ago | (#18091222)

If anything, RAID should make your hard disk access a lot faster. That is, unless you go for software RAID, which will put a hit on your processor. However, I think if you're going to make the investment to go with RAID 5, then buying a proper hardware controller won't add a significant amount to the cost of your set up.

Re:Amazing! (0)

Anonymous Coward | more than 7 years ago | (#18091582)

I MOSTLY agree with your point. But... make sure that your raid controller does not use proprietary methodology. Or if it does, make sure to get backup controller cards. Cause if the controller card goes, then so does all the data on the array.

Re:Amazing! (0)

obarthelemy (160321) | more than 7 years ago | (#18091620)

I'm still not sold on raid at all, especially on desktops and small servers:
- it does NOT eliminate the need for backups
- the performance gains are not noticeable except in the most extreme cases, and even then, RAID is not cost-efficient compared to RAM for caching, except special cases (RAM already maxed out, extremelly heavy sequential data access...)
- RAID hardware / firmware / software is nowhere near as reliable as as plain old PATA/SATA/SCSI, which is something of a scandal
- Raid is rarely cost efficient, compared to putting your dollars into more servers, more RAM

I think RAID is a way to cover one's ass, by displacing resposability for crashes on to the raid vendor, but IRL I keep hearing horror stories.

Re:Amazing! (1)

Rakishi (759894) | more than 7 years ago | (#18091280)

I'd be tempted to say that the problem may be partially on your end either due to having improper conditions (heat, etc.) or bad power/power supplies. Likewise if you get hard drives with a 1 or 3 year warranty then don't expect too much from them (I mean if they're dead in a year then you're not out much as the warranty should cover them... well unless you buy some dirt cheap refurbished 90-day warranty pos).

Personally I backup all my data to a server running raid 1 (hard drives are relatively cheap and raid 1 is simpler to deal with in case of failure imho) daily and plan to back up important stuff onto DVDs. If you really need the space then raid 5 is better and I'd assume that with a good controller the performance hit isn't large at all.

Re:Amazing! (1)

obarthelemy (160321) | more than 7 years ago | (#18091514)

I second that, power suplies especially have a very strong impact on a PC's, and especially its HD's life. Power supplies are then single most important component of a PC, reliability wise: crappy ones tend to fail and/or fry your components, especially if your mains if not too good.

Re:Amazing! (0)

Anonymous Coward | more than 7 years ago | (#18091346)

The RAID level depends on what you plan on using it for. I'd never recommend RAID 5 for a database server because of the performance hit you take during writes; for something like this you're better with a RAID 0+1 setup. RAID 5 is better suited for file servers or backup volumes. I'm also seeing RAID 6 become more popular for this purpose because of the double parity, but setups like this can be expensive.

Re:Amazing! (1)

Simon Garlick (104721) | more than 7 years ago | (#18091472)

I, on the other hand, have personally experienced one HD failure -- a Western Digital drive, as it happens -- in my LIFE.

Re:Amazing! (1)

obarthelemy (160321) | more than 7 years ago | (#18091496)

I have ONE failed HD currently at my place, out of about 15 of various ages (oldest ones is 3gigs), and that failed HD is my lone WD drive. So I have very strong statistical evidence that WD is crap :-p. I personnally buy Seagate since none has yet failed on me... and, joking aside, that WD was my first in a long time, I bought it because it was a tad cheaper than seagate, and it failed really quickly, inside 6 months IIRC.

Re:Amazing! (1)

Blakey Rat (99501) | more than 7 years ago | (#18091720)

... was Western Digital EVER a good manufacturer?

Seriously. The only dead drives I've ever seen are either IBM Deathstars (known by that name so completely that I don't know what the actual brand name is... 'disk star' perhaps?) and Western Digital drives. I generally buy Seagate or Hitachi drives, and I've never had a failure. Usually I run out of space and have to upgrade before the drives die. IBM drives other than the Deathstars seem to do ok as well.

Re:Amazing! (1)

jafiwam (310805) | more than 7 years ago | (#18091910)

"Desk Store" and "Serve Store" I believe.

I lost two raid 5 setups to those because they failed faster than we could replace them. (1 spare and several days for shipping) Out of the 7 we had in two servers, 5 of them failed so the nick name is not undeserved in my opinion.

Re:Amazing! (1)

Planesdragon (210349) | more than 7 years ago | (#18091864)

Now, I am serious, what is wrong with the harddrives I choose that kills them so quickly?

First guess? Your system has a dirty power supply. (Unless you have a high-quality PSU and have a line-noise-filtering UPS, this is entirely possible.)

This article has told me one thing: it's time to get a RAID setup. I've been looking at RAID 5, but two things still trouble me, the price and the performance hit. Does anyone have any information on just how much a performance hit I might experience if I have to access the HD a lot?

RAID 5 is not a great deal more than RAID 0 with a fancy backup. You need to get yourself a good RAID controller (in hardware), and go from there. You should be able to do classic RAID 1 (two drives only) without any perceptible performance hit with a good hardware controller.

If you're still using IDE, switching to SATA would almost certainly eat up any performance hit you would otherwise experience.

And if that doesn't work, do what Windows Vista does: get yourself a large flash drive, and use that for short-term storage.

Question (1)

defiant1 (831834) | more than 7 years ago | (#18091130)

Didn't I read about this on Slashdot a few days ago or did some drives fail and the story was lost?? Must have been a drive failure cause it's unlike Slashdot to have dups :)

infant mortality (5, Insightful)

Anonymous Coward | more than 7 years ago | (#18091192)

I suspect that the 'infant mortality' syndrome really has to do with the drives being abused before they are installed in the machines (getting dropped during shipping for example)

the large shops like these studies are looking at get the drives in bulk directly from the manufacturer, the rest of us who have to go through several middle-men before we get our drives have more of a chance that something happened to them before we received them.

David Lang

Desktop vs Server usage. (3, Insightful)

DigiShaman (671371) | more than 7 years ago | (#18091208)

Key observations from Dr. Schroeder's research:
High-end "enterprise" drives versus "consumer" drives?

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors."

Maybe consumer stuff gets kicked around more. Who knows?


Or maybe powering up the drives off and on is more stressful to the components; say in a desktop environment. With servers racked up, the drives are always spinning with near constant thermal conditions.

Re:Desktop vs Server usage. (1, Interesting)

Anonymous Coward | more than 7 years ago | (#18091232)

Also, residential power is less clean than datacenter power. Bad power can take out the drive electronics.

Re:Desktop vs Server usage. (1)

anagama (611277) | more than 7 years ago | (#18091344)

That's reasonable. At my office in a building built in 1912, I had a computer with a power supply burn out in less than one year (decent $100 Antec case/supply). It was in a room where the lights flickered whenever the fax or printer powered up. Anyway, after I replaced the power supply, I put it on a UPS that protects against brownouts. I would imagine that bad electricity could easily be the culprit for a lot of failures.

Re:Desktop vs Server usage. (1)

complete loony (663508) | more than 7 years ago | (#18091308)

So... Server grade HD's have a longer average life simply because more of them are installed in servers?

Re:Desktop vs Server usage. (1)

pla (258480) | more than 7 years ago | (#18091368)

Or maybe powering up the drives off and on is more stressful to the components

You just posed the one question to which I'd actually have liked to know the answer... Turn it on and off as needed (minimize runtime), or leave it on all the time if you'll use it at least a few times per day (minimize power cycling).

I know that counts as something of a religious issue among geeks, but I'd still have liked a good solid answer on it... It even has implications for whether or not we should let our non-laptops spin drives down when idle.

Oh well, better luck next study (or I can find my own collection of 100k drives to test, I suppose).

Re:Desktop vs Server usage. (2, Informative)

markov_chain (202465) | more than 7 years ago | (#18091760)

I never had a hard drive fail. I buy one more new one a year, and drop the smallest one. I run 4 at a time in a beige box PC. They are a mix of all sorts of manufacturers (usually from a CompUSA sale for less than $0.30/GB).

- I never turn off the PC.
- The case has no cover.

Re:Desktop vs Server usage. (1)

StillAnonymous (595680) | more than 7 years ago | (#18091976)

I too, have never had a hard drive fail on me. I always leave my system running and I use a UPS. When I hear the drive start to make that tell-tale sound of a bearing approaching failure, I buy a new one and replace it before it dies.

Re:Desktop vs Server usage. (4, Interesting)

Lumpy (12016) | more than 7 years ago | (#18091516)

Or she forgot to put in the part that Enterprise drives are replaced on a schedule BEFORE they fail. At Comcast I used to have 30 some servers with 25-50 drives each scattered about the state. every hard drive was replaced every 3 years to avoid failures. These servers (Tv ad insertion servers) made us between $4500-13,000 a minute they were in operation in spurts of 15 minutes down 3-5 minutes inserting ad's. Downtime was not acceptable so we replaced them on a regular basis.

Most enterprise level operations that relies on their data replace drives before they fail. In fac tthe replacement rate was increased to every 2 years not for failure prevention but for capacity increases.

Re:Desktop vs Server usage. (1)

Cthefuture (665326) | more than 7 years ago | (#18091594)

Do people actually shut their desktops off?

The concept is bizarre to me. I haven't shut my desktop off on a daily basis in probably 15 years (or about as long as I've been running Linux as my desktop).

This has nothing to do with the OS though. I don't power cycle any of my important electronics more than needed because I do believe it stresses them. My (PC) computers have always run 24/7 unless there is an electrical storm passing over or I don't have power.

The last time I power cycled on a daily basis was back when I had "console" type computers (C64, Amiga, etc.). Even then they often ran 24/7 serving BBS duty.

Cyrus IMAP (2, Interesting)

More Trouble (211162) | more than 7 years ago | (#18091258)

From StorageMojo's article: Further, these results validate the Google File System's central redundancy concept: forget RAID, just replicate the data three times. If I'm an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive.

For best-of-breed open source IMAP, that means Cyrus IMAP replication.
:w

seems like she could make her own job... (0, Offtopic)

Anonymous Coward | more than 7 years ago | (#18091268)

as head of an independent testing lab. That would probably be a heckuva lot more interesting, and lucrative, than some random gig with Google, IBM, or MS Research.

No "infant mortality" effect? (0)

pla (258480) | more than 7 years ago | (#18091310)

Everything else in there, I think most of us us already knew... Except the "infant mortality" one really surprised me.

I have to wonder, though, did she include DOAs in that, or did she only include drives that worked at least for a few minutes/hours/days? I have to strongly suspect the later - I can't argue with the statistics from 100k drives, but my personal experience with a few dozen drives has shown that they have a strong bias toward either never working, or working for at least a year.

Love the RAID5 stat, though... Perhaps this study will finally convince people to only use RAID for performance or huge-JBOD reasons, never for (the illusion of) reliability.

That's wrong (2, Informative)

ArbitraryConstant (763964) | more than 7 years ago | (#18091398)

It didn't conclude RAID 5 doesn't help, it concludes RAID 5 doesn't help as much as people think, because people think the probability of another failure before the rebuild is complete is negligible and they're wrong.

It helps, and distributing the data more helps more. Someone concerned about multi-drive failures can, for example, use a 3-way RAID 1 array, or a RAID 6 array (which can tolerate the loss of any 2 drives).

Re:That's wrong (2, Insightful)

jcgf (688310) | more than 7 years ago | (#18091498)

In my humble opinion it also helps to use different branded drives in your raid array, that way the chance of them failing at the same time for the same reason is less and you should have longer to do your rebuild.

Re:That's wrong (1)

twiddlingbits (707452) | more than 7 years ago | (#18091956)

The probability of another failure of a drive is the same as it was for the drive that failed (assuming same type and same mfg). However, when that failure point is reached is at a random point within the distribution so while the probability of another failure at any point in time is not zero it is pretty small. MTBF can be influenced by environment and usage patterns. Rebuilding a RAID array isn't a very lengthy process, perhaps a day or two at most if it was a huge array. Plus you SHOULD have backups and snapshots if it was critical data. If you are really paranoid you could also mirror the full set of RAID drives (RAID 51 or RAID 15), that design while costly can handle any THREE drives failing. I'm rusty on my stats but I believe you would multiply the probabilities of 1 drive failing three times (X*X*X) which gets pretty small.

This paper and the Google paper are complementary (4, Informative)

Thagg (9904) | more than 7 years ago | (#18091312)

What's interesting about both of these papers is that previously-believed myths are shown to be, in fact, myths.

The Google paper shows that relatively high temperatures and high usage rates don't affect disk life.
The current paper shows that interface (SCSI, FC vs ATA) had no effect either. The Google paper shows
a significant infant mortality that the CMU paper didn't, and the Google paper shows some years of flat
reliability where the current paper shows decreasing reliability from year one.

The both show that the failure rate is far higher than the manufacturers specify, which shouldn't come
as a surprise to anybody with a few hundred disks.

I'm particularly pleased to see a stake driven through the heart of "SCSI disks are more reliable."
Manufacturers have been pushing that principle for years, saying that "oh, we bin-out the SCSI disks
after testing" or some other horseshit, but it's not true and it's never been true. The disks are
sometimes faster, but they're not "better".

Thad

Re:This paper and the Google paper are complementa (1, Interesting)

Anonymous Coward | more than 7 years ago | (#18091652)

I'm particularly pleased to see a stake driven through the heart of "SCSI disks are more reliable."

I have been saying that for at least 10 years. Back then I worked at a large government contractor and we set up what was then a very large 2 TB array of SCSI drives (about 100 drives). Those damn things were "industrial grade" certified by a large well known server vendor yet we were losing 2 or 3 drives per day for several months. Totally rediculous because I extrapolated the failure rates of IDE drives from another government setup and found it was actually much better than the SCSI drives and they weren't even rated for heavy duty usage.

Of course prior to this article the group-think Slashweenies would moderate me into oblivion (probably will anyway, but meh).

Re:This paper and the Google paper are complementa (1)

StikyPad (445176) | more than 7 years ago | (#18091884)

Almost makes some of these posts [slashdot.org] look like these [slashdot.org] in retrospect.

We need a better file system... (1)

complete loony (663508) | more than 7 years ago | (#18091342)

Further, these results validate the Google File System's central redundancy concept: forget RAID, just replicate the data three times. If I'm an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive.
Someone needs to hurry up and write a good cross platform clustering file system solution. Something that encourages a company to buy bigger, better value HD's for their desktops so they can be used as redundant storage.

100,000 Disk Drives? (1)

MindStalker (22827) | more than 7 years ago | (#18091380)

No its 131072 Does noone care about base-2 anymore?? /To the sarcasm disabled.. Its a joke..

Re:100,000 Disk Drives? (0)

Anonymous Coward | more than 7 years ago | (#18091486)

Or maybe it's 32.

Re:100,000 Disk Drives? (1)

dreddnott (555950) | more than 7 years ago | (#18091542)

100K disk drives would actually be 102,400.

Yay (1)

Daishiman (698845) | more than 7 years ago | (#18091388)

Software RAID FTW!!

In all seriousness, in truly critical storage you save your stuff under a RAID1. RAID5 is simply too unreliable for the task(not to mention that those controllers aren't exactly cheap).

So save yourself trouble, money, and grief, and just user logical volume management to replicate drives.

So SSD's are not only faster, but more reliable? (3, Interesting)

gelfling (6534) | more than 7 years ago | (#18091474)

I wonder if anyone looked at what actually failed in the drives? An arm, a platter, an actuator, a board, an MPU?

Would an analysis tell us that SSDs are not only faster but more reliable and if so by how much?

forget RAID? (2, Informative)

juventasone (517959) | more than 7 years ago | (#18091520)

Translation: one array drive failure means a much higher likelihood of another drive failure ... Further, these results validate the Google File System's central redundancy concept: forget RAID, just replicate the data three times.

The fact that another drive in an array is more likely to fail if one has already failed makes a lot of sense, but the conclusion to forget RAIDs doesn't. Arrays are normally composed of the same drive model, even the same manufacturing batch, and are in the same operating environment. If something is "wrong" with any of these three variables, and it causes a drive to fail, it's common sense the other drives have a good chance at following. I've seen real-world examples of this.

In my real-world situations, the RAID still did it's job, the drive was replaced, and nothing was lost, despite subsequent failure of other drives in the array. Sure you can get similar reliability at a lower price by replicating data, but I think that's always been understood as the case. Furthermore, as someone else in the forum mentioned, enterprise-class RAIDs are often used primarily for performance reasons. A modern hardware RAID controller (with a dedicated processor and ram) can create storage performance unattainable outside of a RAID.

Schroeder's disk... (2, Funny)

Anonymous Coward | more than 7 years ago | (#18091566)

is neither working nor broken... Unless you look at it of course ;)

How much does handling matter? (5, Interesting)

RebornData (25811) | more than 7 years ago | (#18091588)

What's interesting to me is that neither of these papers mentions the issue of pre-installation handling. The good folks over at Storage Review [storagereview.com] seem to be of the opinion [storagereview.com] that the shocks and bumps that happen to a drive between the factory and the final installation are the most significant factor in drive reliability (much more than brand, for example).

The google paper talks a bit about certain drive "vintages" being problemmatic, but I wonder if they buy drives in large lots, and perhaps some lots might have been handled roughly during shipping. If they could trace back each hard drive to the original order, perhaps they could look to see if there's a correlation between failure and shipping lot.

-R

Lemon or not (1)

Bullfish (858648) | more than 7 years ago | (#18091592)

I doubt MTBF fits into anyone's thoughts when buying a drive, unless they are buying bulk or such for a business and have to justify the choice. I am only talking about home use here.

Personally I have only ever had one drive go on me (a quantum scirroco) in 10 years. For myself, and most home users, that's a great track record. On the other hand, I have had friends and relatives who's drives just up and quit. New ones, old one, many brands. As long as you buy a major brand, they seem to be more or less equal in practice.

That said, with drives going at 10K rpm, the heat, etc, there are going to be lemons. I suspect that will always be a long as we use mechanical drives. I am not suprprised warranty periods dropped about the time drives began to exceed 7200 rpm. Always remember to back up data that's important and keep those receipts.

all this is moot (3, Insightful)

billcopc (196330) | more than 7 years ago | (#18091698)

Hard drives die often because the manufacturers build them cheaply, the same as every other component in a PC. Why would they ever make a bulletproof hard drive ? They'd go out of business!

Sure, some of them end up being replaced under warranty, but a lot of them don't, and so Maxtor/IBM/Hitachi make another buck off your sorry ass. There isn't a sane server admin that doesn't keep a set of spares in his desk drawer, because it's not a question of "if" it dies but WHEN. Hell, most decently-geared techies have a whole box of hard drives, pre-mounted in hotswap bays ready to rock. And if it weren't for the fact that I was just laid off a month ago, I'd be buying a couple spare SATA drives myself, I just have a funny feeling something's going to go tits up in my media server. I haven't had any warnings or hiccups, but I just know the Seagate devil's planning his move, waiting for 2 drives to start straying so he can kill my Raid-5 nice and fast. Hard drives are little more than Murphy's Law in a box.

Exponential with time (2, Informative)

tedgyz (515156) | more than 7 years ago | (#18091726)

All the hard drives I installed in my family's computers have failed in the last 5 years - including mine. :-(

Waaaah! They cry, when I tell them there is no hope for the family photos, barring a media reclamation service == $$$

I tell everyone: "Assume your hard drive will fail at any moment, starting now! What is on your hard drive that you would be upset if you never saw it again?"

No mention of the co-author? (1)

Petro123 (833232) | more than 7 years ago | (#18091742)

This paper was co-authored by Garth Gibson!

Nothing I knew about hard drives was mentioned (3, Insightful)

AllParadox (979193) | more than 7 years ago | (#18091944)

As mechanical devices, hard drives are appallingly reliable.

The electronics on the hard drive rank as major players in heat generation in the boxen.

Heat kills transistorized components.

"Hard Drive Data Recovery" companies often have nothing more sophisticated than a hard drive buying program, and very competent techs soldering and unsoldering drive electronics. They buy a few each of most available hard drives, as the drives appear on the market. When a customer sends them a hard drive for "recovery", the techs find a matching drive in inventory, disconnect the electronics, and replace the electronics in the drive. The percentage of drive failures due to mechanical failure is very low.

When I bought a desktop computer for an unsophisticated family member, I also purchased and installed a drive cooler - a special fan that blows directly on the drive electronics.

I was very concerned about MTBF. I just assumed that the manufacturer's information was totally irrelevant to my situation - a hard drive in a corner of the tower, covered with dust, and no air circulation.

I occasionally pick up used equipment from family and friends. Usually, it is broken. Often, it is the hard drive. What is amazing is not that they failed, but that they lasted so long with a 1.5 inch coating of insulating dust.

I suspect this would also explain the rising failure rate with time. Nobody seems to clean the darned things. They just sit and run 24/7/365, until they fail.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>