Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Disk Failure Rates More Myth Than Metric

Zonk posted more than 6 years ago | from the like-the-loch-ness-hard-drive dept.

Data Storage 283

Lucas123 writes "Using mean time between failure rates suggest that disks can last from 1 million to 1.5 million hours, or 114 to 170 years, but study after study shows that those metrics are inaccurate for determining hard drive life. One study found that some disk drive replacement rates were greater than one in 10. This is nearly 15 times what vendors claim, and all of these studies show failure rates grow steadily with the age of the hardware. One former EMC employee turned consultant said, 'I don't think [disk array manufacturers are] going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the numbers.'"

cancel ×

283 comments

Sorry! There are no comments related to the filter you selected.

Frost piss (-1, Flamebait)

(TK2)Dessimat0r (669581) | more than 6 years ago | (#22974518)

Take my cum, down it in one, and then take a glass of wine to remind you of the good life.

Never had a drive fail (4, Interesting)

Jafafa Hots (580169) | more than 6 years ago | (#22974524)

I've gone through many over the years, replacing them as they became too small - still using some small ones many years old for minor tasks, etc. and he only drive I've ever had partially fail is the one I accidentally launched across a room.

I don't understand how people are always complaining about their hard drives failing. In 30 years it hasn't happened to me yet.

I'm about to lug a huge Wang hard drive out to the trash pickup on Monday - weighs over 100 pounds... still runs. Actually it uses removable platters but still...

Never had a drive *not* fail. (4, Informative)

Murphy Murph (833008) | more than 6 years ago | (#22974564)

I've gone through many over the years, replacing them as they became too small - still using some small ones many years old for minor tasks, etc. and he only drive I've ever had partially fail is the one I accidentally launched across a room.

My anecdotal converse is I have never had a hard drive not fail. I am a bit on the cheap side of the spectrum, I'll admit, but having lost my last 40GB drives this winter I now claim a pair of 120s as my smallest.
I always seem to have a use for a drive, so I run them until failure.

Re:Never had a drive *not* fail. (1)

neumayr (819083) | more than 6 years ago | (#22974718)

Wow, you had 40GB drives last this long?
Impressive. All mine were of the IBM Deathstar type :-/

Re:Never had a drive *not* fail. (1)

kesuki (321456) | more than 6 years ago | (#22974954)

well see I read slashdot when i bought my 80 gig drives, right about the time they started calling IBM deathstars, so i have two really nice 80 gig ibm drives that came from their OTHER plant and they're doing real nice.. FWIW my smallest HDD that works is a 3.5" 4.3GB maxtor, I always researched every HD i ever bought after losing almost 2 GB of irreplaceable data to my first ever HD failure (was a maxtor, ironically I RMAed it and got the 4.3 GB that is currently my smallest, they no longer were selling the 2 GB models etc etc free upgrade.

I also am somewhat good at using optical media for backup (used to use zip drives, then tape, now optical discs) but it didn't stop me from loosing about 120 GB* of data this year to 5 count em 5 HD failures, only 1 of them was electrocuted by me buying a cheap PSU to try to get my long not used 'buggy asus' Dual AMD MP 2000+ system to run because i figured it needed a 500 watt ATX1.4 PSU something that i ran across online last year (it shipped with a 400 watt PSU because that was the best ATX 1.4 PSU the company i bought from had, even though the damn system locked up every 45 minutes of run time, no matter the OS because of under-voltage) was actually a really nice 400 watt PSU, but 400 watts didn't cut it, for those watt hungry Athlon MP's not even with 1 hd and no optical drive, and minimal fans.

plus to complicate it 1 TB of my optical media is infected with a nasty windows rootkit. (sigh) but i found a solution to cleanse my files safely. (linux machine + 'clean' windows usb drive enclosure, sadly I'm missing part 3 a GOOD (google mail good) windows rootkit scanner for Linux) no, the stupid scanner people recommend for cleaning email sucks and can't detect the rootkit. i've yet to find a windows solution that can detect it, so far only Diff can tell between a system that is infected by exposure and one that isn't. based on files that update for no reason when optical media is inserted) detecting the files it modifies != detecting the rootkit. all i know is gmail detects it and no virus/rootkit scanner i've tried has. and i don't have time to send 1 TB of data through gmail not even on cable.

*= not total capacity total capacity was 210 GB or so, but i only lost around 120 GB (max) since i was pulling one drive, 2 drives test drives in test setups (no data loss) 1 was electrocuted in my asus POS workstation from hell no loss and 1 was already backed up and formatted when it started corrupting data on a test system during configuration no loss) i'm not sure what was on the 120 GB i lost, so i might have had most of it backed up with the exception of about 20-30 GB(was backing up those files when it failed ahh sigh for not backing up my old server drives before they got 8+ years old)

Re:Never had a drive *not* fail. (2, Insightful)

neumayr (819083) | more than 6 years ago | (#22975158)

*blink*

Okay, when I think of backup, it's data backup.
I wouldn't backup applications or operating systems, just their configuration files.
Anyway, what I'd try doing is diff(1)ing all those backed up system files with the originals.

Or am I missing something completely, and it's some weird rootkit that's embedded in some wm* media file?

Re:Never had a drive *not* fail. (-1, Troll)

Anonymous Coward | more than 6 years ago | (#22975160)

Your post would be much easier to read if you used capital letters at the beginning of each sentence.

Re:Never had a drive *not* fail. (1)

autocracy (192714) | more than 6 years ago | (#22974986)

I've heard stories of the Deathstars. Thus far, I've only lost iPod hard drives (though those failures started a week after I mentioned "I've never lost a drive before."). I've been run twin 18gig IBM De(ath|sk)star SCSIs for something on the order of 6 years now. Right now I have the laptop (80g), external (close to 500g), and twin SCSI system. Going to be turning up a 6 spindle set of 18 gig drives soon. Anyway, I suppose I should double-check my backups now.

Re:Never had a drive *not* fail. (3, Interesting)

Depili (749436) | more than 6 years ago | (#22975112)

The deathstars were all 80gt PATA disks, manufactured by a single plant, had 8 of them, all failed.

Re:Never had a drive *not* fail. (2, Insightful)

KillerBob (217953) | more than 6 years ago | (#22975200)

I still have a working 10MB hard drive from an IBM 8088... >.> (and yes, that system still works too, complete with the Hercules monochrome graphics and orange scale CRT)

Re:Never had a drive *not* fail. (0)

Anonymous Coward | more than 6 years ago | (#22975248)

But does it run Linux?

Re:Never had a drive *not* fail. (1)

hcmtnbiker (925661) | more than 6 years ago | (#22974866)

My anecdotal converse is I have never had a hard drive not fail. I am a bit on the cheap side of the spectrum, I'll admit, but having lost my last 40GB drives this winter I now claim a pair of 120s as my smallest. I always seem to have a use for a drive, so I run them until failure.

If this was the case I would seriously consider looking for a problem that's not directly related to the hard drives themselves. Around 80% of HDD failures are controller board failures, I wonder if maybe your setup is experiencing electrical problems, brownouts or surges that might mess with the controller boards. I myself have never had an HDD fail on me before even with constant abuse.

Re:Never had a drive *not* fail. (1)

Murphy Murph (833008) | more than 6 years ago | (#22974902)

I have very clean power, and use UPSs to boot. I believe I simply use them much longer than average. How many drives have you had running 24x7x365 for seven years?

Re:Never had a drive *not* fail. (0)

Anonymous Coward | more than 6 years ago | (#22974944)

I CAN'T type today!
24x7x52

Re:Never had a drive fail (5, Funny)

Anonymous Coward | more than 6 years ago | (#22974578)

Wait. You've got a huge Wang, and you're throwing it out? D00d, that's just uncool. Give it to someone else at least. It would be fun to ask people "wanna come see my huge Wang?" just to see their reaction! :)

hah. captcha word: largest

Recycle, don't just dump it! (1, Interesting)

Anonymous Coward | more than 6 years ago | (#22974702)

He should look at the escalating price of gold too. Older the computer component the more gold in the connectors and the thicker the gold on the traces, etc.. Not to mention other precious metals involved in some of the components such as platinum, paladium, etc.. Perhaps the greatest consideration should be given to the fact that it would increase the heavy metal pollution at the dump it goes to.

Probably some nice magnets inside to play with too. :P

Re:Never had a drive fail (3, Insightful)

Anonymous Coward | more than 6 years ago | (#22974594)

Drive failures are actually fairly common, but usually the failures are due to cooling issues. Given that most PCs aren't really set up to ensure decent hard drive cooling, it is probable that the failure ratings are inflated due to operation outside of the expected operational parameters (which are probably not conservative enough for real usage). In my opinion, if you have more than a single hard drive closely stacked in your case you should have some sort of hard drive fan.

Re:Never had a drive fail (3, Informative)

hedwards (940851) | more than 6 years ago | (#22974742)

I think cooling issues are somewhat less common than most people think, but they are definitely significant. And I wouldn't care to suggest that people neglect to handle heat dissipation on general principle.

Dirty, spikey power is a much larger problem. A few years back I had 3 or 4 nearly identical WD 80gig drives die within a couple of months of each other, They were replaced with identical drives that are still chugging along find all this time later. The only major difference is that I gave each system a cheapo UPS.

Being somewhat I cheap, I tend to use disks until they wear out completely. After a few years I shift the disks to storing things which are permanently archived elsewhere or swap. Seems to work out fine, only problem is what happens if the swap goes bad while I'm using it.

Re:Never had a drive fail (4, Informative)

afidel (530433) | more than 6 years ago | (#22975070)

I would tend to agree with that. I run a datacenter that's cooled to 74 degrees and has good clean power from the online UPS's and I've had 6 drive failures out of about 500 drives over the last 22 months. Three were from older servers that weren't always properly cooled (the company had a crappy AC unit in their old data closet.) The other three all died in their first month or two after installation. So properly treated server class drives are dying at a rate of about .5% per year for me, I'd say that jives with manufacturer MTBF.

Re:Never had a drive fail (3, Insightful)

GIL_Dude (850471) | more than 6 years ago | (#22974748)

I'd agree with you there; I have had probably 8 or 9 hard drives fail over the years (I currently have 10 running in the house right now and I have 8 running at my desk at work, so I do have a lot of drives). I am sure that I have caused some of the failures by just what you are talking about - I've maxed out the cases (for example my server has 4 drives in it, but was designed for 2 - I had to make my own bracket to jam the 4th in there, the 3rd went in place of a floppy). But I've never done anything about cooling and I probably caused this myself. Although to hear the noises coming from some of the platters when they failed I'm sure at least a couple weren't just heat. For example at work I have had 2 drives fail in just bog standard HP Compaq dc7700 desktops (without cramming in extra stuff). Sometimes they just up and die, other times I must have helped them along with heat.

Re:Never had a drive fail (3, Informative)

Depili (749436) | more than 6 years ago | (#22975136)

Excess heat can cause the lubricant of a hd to go bad and causes weird noises, also logic board failures/head positioning failures cause quite a racket.

In my experience most drives fail without any indications from smart tests, ie. logic board failures, bad sectors are quite rare nowadays.

Re:Never had a drive fail (1)

mpeskett (1221084) | more than 6 years ago | (#22974870)

Well, that's somewhat reassuring - I have 3 drives, but they have at least one drive space on either side and a fan blowing air into the case directly over/between them. Ought to be nice and cool.

Never had a failure myself. I thought a portable drive had gone bad once but it turned out to be the USB lead... a bit annoying, but I got a bigger one to replace it, meaning I now have more space, which is good.

Re:Never had a drive fail (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#22974600)

"Huge WANG hard drive" wrong forum?

Re:Never had a drive fail (5, Funny)

serviscope_minor (664417) | more than 6 years ago | (#22974726)

I'm about to lug a huge Wang hard drive out to the trash pickup on Monday - weighs over 100 pounds... still runs. Actually it uses removable platters but still...

<Indiana Jones> IT BELONGS IN A MUSEUM!</Indiana Jones>

It belongs in a museum (1)

Cowclops (630818) | more than 6 years ago | (#22975084)

Panama Hat: SO DO YOU!

Re:Never had a drive fail (1)

danwat1234 (942579) | more than 6 years ago | (#22974770)

Dude don't chuck it! "It Belongs in a museum!"

Re:Never had a drive fail (3, Informative)

kesuki (321456) | more than 6 years ago | (#22974776)

And i had 5 fail This year, welcome, the the law of averages. note i own about 15 hard drives including the 5 that failed.

Re:Never had a drive fail (1)

Kibblet (754565) | more than 6 years ago | (#22974794)

I wish I had your luck.

Re:Never had a drive fail (2, Informative)

Kjella (173770) | more than 6 years ago | (#22974846)

1.6GB drive: failed
3.8GB drive: failed
45GB drive: failed
2x500GB drive: failed

Still working:
9GB
27GB
100GB
120GB
2x160GB
2x250GB
3x500GB
2x750GB
3x500GB external

However, in all the cases they've been the worst possible. The 45GB drive was my primary drive at the time with all my recent stuff. The 2x500GB were in a RAID5, you know what happens in a RAID5 when two drives fail? Yep. Right now I'm running 3xRAID1 for the important stuff (+ backup), JBOD on everything else.

Re:Never had a drive fail (1)

Thought1 (1132989) | more than 6 years ago | (#22974854)

I've only had two drives fail, but that was due to a tree falling on the power lines close to my house and shorting them (watching the bright showers of sparks into the road in the dark was fun, though). I think I've bought somewhere around 50 of them over the last 15 years (since they actually got inexpensive enough to be worth buying).

Re:Never had a drive fail (0)

Anonymous Coward | more than 6 years ago | (#22974858)

i had a friend who said the same thing once but he lived to regret it. you can email him yourself and maybe he would want to share the details -- jaredj@aieranco.com

Re:Never had a drive fail (4, Funny)

STrinity (723872) | more than 6 years ago | (#22974862)

I'm about to lug a huge Wang
There needs to be a -1 "Too Easy" moderation option.

At work (0)

Anonymous Coward | more than 6 years ago | (#22974888)

I'm at work right now...copying terabytes of data from an array that has failing drives and cannot rebuild itself due to the amount of simultaneous drive failures. I have been here for 32 hours. So, please don't give me this "hard drives never fail" crap!

Re:Never had a drive fail (0)

Anonymous Coward | more than 6 years ago | (#22974890)

I don't know how you've done that, but good for you.

I've been around tech for 20 years and in the industry for 15. I've personally seen at least a dozen drives fail, across the following brands: Western Digital, Seagate, IBM, Hitachi and especially Maxtor (these WILL fail, without failure).

We currently have a failing drive in one server, and I've just replaced failing drives in two workstations, this week alone. These drives are all 3-5 years old and on 24/7, being hit by an application that hits the drives 24/7.

At home, I've never had one fail, either the workstation or drive is replaced within 3-5 years.

Re:Never had a drive fail (1)

tdelaney (458893) | more than 6 years ago | (#22974894)

For an opposing anecdote, my family had 3 fairly new drives fail within 3 months of each other - 1 Seagate (approx 1 year old), 1 Samsung (approx 6 months old) and 1 Western Digital (3 weeks old).

During this period, I learned not to buy WD drives in Australia again - whereas Seagate and Samsung handle warranty returns locally, and each took about 3 days to get a new drive to me, WD wanted me to send the drive to Singapore, and estimated a 4-week turnaround. Fortunately, I was able to convince the retailer to take it back (for a restocking fee), and was able to buy the same-sized Samsung for less than the modified refund.

OTOH, I've still got 8GB drives that work just fine.

OTGH, I bought 2 IBM Deathstars (75GXP) several years ago (at the same time, presumably the same batch) - one died very quickly, but the other is still in use today.

Re:Never had a drive fail (1)

cheater512 (783349) | more than 6 years ago | (#22975134)

I make a point of buying WD drives in Australia.
Never had one fail yet. Very impressed. :)

There are only two kind of peeps... (5, Insightful)

**loki969** (880141) | more than 6 years ago | (#22974538)

...those that make backups and those that never had a hard drive fail.

Re:There are only two kind of peeps... (5, Insightful)

Raineer (1002750) | more than 6 years ago | (#22974590)

I see it the other way... Once I start taking backups my HDD's never fail, it's when I forget that they crash.

Re:There are only two kind of peeps... (2, Funny)

BSAtHome (455370) | more than 6 years ago | (#22975060)

Real men don't make backups; they cry.

Re:There are only two kind of peeps... (0)

Anonymous Coward | more than 6 years ago | (#22974604)

What about those of us who are both. Or some people I know who have had 3 failures yet still don't back up (it has gotten to be a joke among their friends)

Re:There are only two kind of peeps... (1)

gparent (1242548) | more than 6 years ago | (#22974690)

And I'm part of those who've never had a hard drive fail :)

Re:There are only two kind of peeps... (1)

Metasquares (555685) | more than 6 years ago | (#22974754)

Don't forget about those of us who just keep their important work checked into a remote version control system.

Re:There are only two kind of peeps... (1)

johannesg (664142) | more than 6 years ago | (#22974914)

That counts as backup... Noone says it *has* to be tape you know.

Re:There are only two kind of peeps... (2, Insightful)

OS24Ever (245667) | more than 6 years ago | (#22974766)

More like 'those that never owned an IBM Deskstar drive'

Re:There are only two kind of peeps... (1)

BSAtHome (455370) | more than 6 years ago | (#22975072)

I remember those 75G IBM drives. Had an array of them, totaling 16 drives, 14 of them failed within 12 months.

Re:There are only two kind of peeps... (0)

Anonymous Coward | more than 6 years ago | (#22974868)

I use a raid 1 mirror.
Have had many HD's fail, don't make backups.

Am i of the "those that make backups" type, as the mirror can be seen as a continual backup, or am i of the "those that never had a harddisk fail", because I've never had a raid 1 mirror fail?

Marketplace can't function without good data (5, Insightful)

dpbsmith (263124) | more than 6 years ago | (#22974554)

If everyone knows how much a disk drive costs, and nobody can find out how long a disk drive really will last, there is no way the marketplace can reward the vendors of durable and reliable products.

The inevitable result is a race to the bottom. Buyers will reason they might was well buy cheap, because they at least know they're saving money, rather then paying for quality and likely not getting it.

Re:Marketplace can't function without good data (0)

Anonymous Coward | more than 6 years ago | (#22974572)

The problem is that the consumers are idiots. How many of them do you suppose understand what Mean Time Between Failure statistics indicate? Example:

Using mean time between failure rates suggest that disks can last from 1 million to 1.5 million hours, or 114 to 170 years
That is obviously wrong.

Re:Marketplace can't function without good data (1)

piojo (995934) | more than 6 years ago | (#22975022)

The inevitable result is a race to the bottom. Buyers will reason they might was well buy cheap, because they at least know they're saving money, rather then paying for quality and likely not getting it.
That's the description of a lemon market. However, I don't think it applies here, because brands gain reputations in this realm. If one brand of hard drives becomes known as flaky, people (and OEMs) will stop buying it.

Re:Marketplace can't function without good data (3, Interesting)

commodoresloat (172735) | more than 6 years ago | (#22975046)

If everyone knows how much a disk drive costs, and nobody can find out how long a disk drive really will last, there is no way the marketplace can reward the vendors of durable and reliable products.
And that may be the exact reason why the vendors are providing bad data. On the flip side, however, if people knew how often drives failed, perhaps we'd buy more of them in order to always have backups.

MTBF For Unused Drive? (1)

sarahbau (692647) | more than 6 years ago | (#22974558)

Maybe they mean the MTBF for drives that are just on, but not being used. I've never put any stock into those numbers, because I've had too many drives fail to believe that they're supposed to be lasting 100 years. I've had 3 die in the last 3 years alone (all in my server, so probably getting more than average use, but still...)

Re:MTBF For Unused Drive? (4, Interesting)

zappepcs (820751) | more than 6 years ago | (#22974664)

The problem is that the MTBF is calculated on an accelerated lifecycle test schedule. Life in general does not actually act like the accelerated test expanded out to 1day=1day. It is an approximation, and prone to errors because of the aggregated averages created by the test.

On average, a disk drive can last as long as the MTBF number. What are the chances that you have an average drive? They are slim. Each component in the drive, every resistor, every capacitor, every part has an MTBF. They also have tolerance values: that is to say they are manufactured to a value with a given tolerance of accuracy. Each tolerance has to be calculated as one component out of tolerance could cause failure of complete sections of the drive itself. When you start calculating that kind of thing it becomes similar to an exercise in calculating safety on the space shuttle... damned complex in nature.

The tests remain valid because of a simple fact. In large data centers where you have large quantities of the same drive spinning in the same lifecycles, you will find that a percentage of them fail within days of each other. That means that there is a valid measurement of the parts in the drive, and how they will stand the test of life in a data center.

Is your data center an 'average' life for a drive? The accelerated lifecycle tests cannot tell you. All the testing does is look for failures of any given part over a number of power cycles, hours of use etc. It is quite improbable that your use of the drive will match that of the expanded testing life cycle.

The MTBF is a good estimation of when you can be certain of a failure of one part or another in your drive. There is ALWAYS room for it to fail prior to that number. ALWAYS.

Like any electronic device for consumers, if it doesn't fail in the first year, it's likely to last as long as you are likely to be using it. Replacement rates of consumer societies mean that manufacturers don't have to worry too much about MTBF as long as it's longer than the replacement/upgrade cycle.

If you are worried about data loss, implement a good data backup program and quit worrying about drive MTBFs.

Re:MTBF For Unused Drive? (2, Insightful)

WaltBusterkeys (1156557) | more than 6 years ago | (#22974684)

Great post above. It also depends on how you count "failure." I've had external drives fail where the disk would still spin up, but the interface was the failure point. I took the disk out of the external enclosure and it worked just fine with a direct IDE (I know, who uses that anymore?) connection.

If I were running a data-based business I'd count that as a "failure" since I had to go deal with the drive, but the HD company probably wouldn't since no data was permanently lost.

Re:MTBF For Unused Drive? (2, Insightful)

BSAtHome (455370) | more than 6 years ago | (#22975142)

There is another failure rate that you have to take into account: unrecoverable bit-read error-rate. This is detected as an error in the upstream connection, which can cause the controller to fail the drive. An unrecoverable read fails the ECC mechanism and can under circumstances be recovered by performing a re-read of the sector.

The error-rate is in the order of 10^14 bits. Calculating this on a busy system, reading 1MBytes/s gives you approx. 10^7 seconds for each unrecoverable read failure. Or, that means it occurs 3 times per year on average. So, forget MTBF on busy systems and hope that your controller is able to do re-reads on a disk. Otherwise, your busy system/array is not going to last very long.

Re:MTBF For Unused Drive? (0)

Anonymous Coward | more than 6 years ago | (#22975218)

No, that's not how MTBF for hard drive is calculated. And that's also not the "average" hours of an "average" drive can expect if someone gets an "average" hard drive.

See here for an extensive discussion, http://forums.storagereview.net/index.php?showtopic=18811 [storagereview.net] I'm sure there's more info in that forum.

Re:MTBF For Unused Drive? (1)

NovaSupreme (996633) | more than 6 years ago | (#22974694)

MTBF has *nothing* to do with life expectancy. It's failure rate of good drives, that are not expected to fail. More precisely it's failure rate of the drives in the conditions when their failure rate is constant (that precludes high rate failure in the beginning and in the end). I can make hard drives that are guaranteed to work for 1 day, tell that to my customers and their MTBF will be infinity, since there will never be unexpected failures! Another example MTBF of a healthy adult in USA may be 10000 years (chances of road accidents are small) but life expectancy is only 80 years!

Failure rates ! warranty period. (0)

Kenja (541830) | more than 6 years ago | (#22974582)

As drive sizes have been going up, overall the warranty periods have been going down. With few exceptions (Seagate does three years) drives have a one year expected life time.

Re:Failure rates ! warranty period. (5, Informative)

ABasketOfPups (1004562) | more than 6 years ago | (#22974670)

Warranty periods for 750 gig and 1 terabyte drives from Western Digital [zipzoomfly.com] , Samsung [zipzoomfly.com] , and Hitachi [zipzoomfly.com] , are 3 years to 5 years according to the info on zipzoomfly.com.

A one year warranty doesn't seem that common. External drives seem to have one year warranties, but even SATA drives at Best Buy mostly have 3 years

warranties (4, Insightful)

qw0ntum (831414) | more than 6 years ago | (#22974602)

The best metric is probably going to be the length of warranty the manufacturer offers. They have financial incentive to find out the REAL mean time until failure in calculating the warranty.

Re:warranties (1)

dh003i (203189) | more than 6 years ago | (#22974686)

The best metric is probably going to be the length of warranty the manufacturer offers. They have financial incentive to find out the REAL mean time until failure in calculating the warranty.
They do provide "real" MTBF numbers. It's just MTBF isn't for what you think it's for. See my post explaining this.

Re:warranties (1)

qw0ntum (831414) | more than 6 years ago | (#22974780)

Yes... we say the same thing (last paragraph). I know very well what MTBF means and how it's calculated. In your words, I put my stock in the warranty, because "that's what they're willing to put their money behind." The warranty is set so that most devices don't stop working until after the warranty period ends. This more accurately reflects the amount of time a drive lasts under normal use.

I'm not saying that MTBF isn't a completely unreliable number. I'd imagine there is a correlation between higher MTBF numbers and warranty.

Great post, by the way. Very informative and well worded. :)

Re:warranties (1)

afidel (530433) | more than 6 years ago | (#22975108)

It's worse than your post implies because the manufacturers actually specify that drives be replaced every so often to get the MTBF rating. Basically the only thing an MTBF rating is good for is figuring out statistically what the chances are of a given RAID configuration losing data before a rebuild can be completed.

Re:warranties (1)

Murphy Murph (833008) | more than 6 years ago | (#22974788)

The best metric is probably going to be the length of warranty the manufacturer offers. They have financial incentive to find out the REAL mean time until failure in calculating the warranty.

ASSuming anything approaching a significant of drives which fail during the warranty period are claimed. Otherwise a warranty is nothing more than advertising.
I strongly suspect this is not the case and you are simply replacing one false metric with another.

Re:warranties (3, Insightful)

ooloogi (313154) | more than 6 years ago | (#22975256)

Warranties beyond about two years become largely meaningless for this purpose, because after a drive is getting older people often won't bother claiming warranty for what is by then such a small drive. The cost of shipping/transport is likely to be more than the marginal $/GB on a new drive.

So in this way a manufacturer can get away with a long warranty, without necessarily incurring a cost for unreliability.

Easy to get the quoted figures ... (1)

Alain Williams (2972) | more than 6 years ago | (#22974606)

put the 500GB drive into your bottom drawer ... the unused disk will break when thrown out by your great great grand kids - who will simultaneously wonder if you really did use storage of such tiny capacity.

What MTBF is for. (5, Insightful)

sakusha (441986) | more than 6 years ago | (#22974640)

I remember back in the mid 1980s when I received a service management manual from DEC, it had some information that really opened my eyes about what MTBF was really intended for. It had a calculation (I have long since forgotten the details) that allowed you to estimate how many service spares you would need to keep in stock to service any installed base of hardware, based on MTBF. This was intended for internal use in calculating spares inventory level for DEC service agents. High MTBF products needed fewer replacement parts in inventory, low MTBF parts needed lots of parts in stock. Presumably internal MTBF ratings were more accurate than those released to end users.

So anyway.. MTBF is not intended as an indicator of a specific unit's reliability. It is a statistical measurement to calculate how many spares are needed to keep a large population of machines working. It cannot be applied to a single unit in the way it can be applied to a large population of units.

Perhaps the classical example is about the old tube-based computers like ENIAC, if a single tube has an MTBF of 1 year, but the computer has 10,000 tubes, you'd be changing tubes (on average) more than once an hour, you'd rarely even get an hour of uptime. (I hope I got that calculation vaguely correct)

Re:What MTBF is for. (1)

dh003i (203189) | more than 6 years ago | (#22974676)

Good post, I think we were on the same wavelength, as I posted something very similar to that below.

Re:What MTBF is for. (3, Informative)

sakusha (441986) | more than 6 years ago | (#22974740)

Thanks. I read your comment and got to thinking about it a bit more. I vaguely recall that in those olden days, MTBF was not an estimate, it was calculated from the service reports of failed parts. The calculations were released in monthly reports so we could increase our spares inventory to cover parts that were proving to be less reliable than estimated. But then, those were the days when every installed CPU was serviced by authorized agents, so data gathering was 100% accurate.

Re:What MTBF is for. (4, Informative)

davelee (134151) | more than 6 years ago | (#22974830)

MTBFs are designed to specify a RATE of failure, not the expected lifetime. This is because disk manufacturers don't test MTBF by running 100 drives until they die, but rather running say, 10000 drives and counting the number that fail during some period of months perhaps. As drives age, clearly the failure rate will increase and thus the "MTBF" will shrink.

long story short -- a 3 year old drive will not have the same MTBF as a brand new drive. And a MTBF of 1 million hours doesn't mean that the median drive will live to 1 million hours.

Re:What MTBF is for. (2, Informative)

flyingfsck (986395) | more than 6 years ago | (#22974912)

That is an urban legend. Colossus and Eniac were far more reliable than that. The old tube based computers seldom failed, because the tubes were run at very low power levels and tubes degrade slowly, they don't pop like a light bulb (which is run at a very high power level to make a little visible light). Colossus for example was built largely from Plessey telephone exchange registers and telex machines. These registers were in use in phone exchanges for decades after the war. I saw some tube based exchanges in the early 80s that were still going strong.

Misunderstanding MTBF (4, Interesting)

dh003i (203189) | more than 6 years ago | (#22974654)

I think that a lot of people are mis-understanding MTBF. A HD might have a MTBF of 100 years. This doesn't mean that the company expects the vast majority of consumers to have that HD running for 100 years without problems.

MTBF numbers are generated by running say thousands of hard-drives of the same model and batch/lot, and seeing how long it takes before 1 fails. This may be a day or so. You then figure out how many total HD running hours it took before failure. If you have 1,000 HD's running, and it takes 40 hours before one fails, that's a 40,000 hr MTBF. But this number isn't generated by running say 10 hard-drives, waiting for all of them to fail, and averaging that number.

Thus, because of the way MTBF numbers are generated, they may or may not reflect hard-drive reliability beyond a few weeks. It depends on our assumptions about hard-drive stress and usage beyond the length of time before the 1st HD of the 1,000 or so they were testing failed. Most likely, it says less and less about hard-drive reliability beyond that initial point of failure (which is on the order of tens or hundreds of hours, not hundreds of thousands of hours or millions of hours!).

To be sure, all-else equal, a higher MTBF is better than a lower one. But as far as I'm concerned, those numbers are more useful for predicting DOA, duds, or quick-failure; and are more useful to professionals who might be employing large arrays of HD's. They are not particularly useful for getting a good idea of how long your HD will actually last.

HD manufacturers also publish an expected life-cycle of their HD. But I usually put the most stock in the length of the warranty. That's what they're willing to put their money behind. Albeit, it's possible their strategy is just to warranty less than how long they expect 90% of HD's to last, so they can then sell them cheaper. But if you've had a HD and you've had it for longer than what the manufacturer publishes as the expected-life, what they're saying by that is you've basically got a good value, and will probably want to have something else on hand, and be backed up.

Re:Misunderstanding MTBF (1)

flyingfsck (986395) | more than 6 years ago | (#22974950)

Nope, MTBF is usually *calculated* and the number is just that - a number - it means fuck-all in real time. The numbers are used comparitively, to show the designers which potentially stressed components need to be looked at during the design phase. Eventually the numbers are mis-used by the marketing department to mislead the customers, but that is not the intent of the designers and is not the purpose of the MTBF calculations.

Re:Misunderstanding MTBF (2, Insightful)

scsirob (246572) | more than 6 years ago | (#22975018)

"A HD might have a MTBF of 100 years"

That's not how it works. A certain type of HD may have a specified MTBF, a single drive never does. It's all about quantities. A drive may be designed for 5 years of economic life. That's 43800 hours.

If that type of drive is specified for 1 million hours MTBF, approximately one in every 23 drives will fail within those 5 years.

If you run a disk array with about 115 of these drives, you will have an average of one drive fail every month. Run a data centre with 3500 drives and you will have a drive failure every day.

Temperature is the key (4, Interesting)

arivanov (12034) | more than 6 years ago | (#22974660)

Disk MTBF is quoted for 20C.

Here is an example of my server. At 18C ambient in a well cooled and well designed case with dedicated hard drive fans he Maxtors I use for RAID1 run at 29ÂC. My Media server which is in the loft with sub-16C ambient runs them at 24-34 depending on the position in the case (once again, proper high end case with dedicated hard drive fans).

Very few hard disk enclosures can bring the temperature down to 24-25C.

SANs or high density servers usually end up running disks at 30C+ while at 18C ambient. In fact I have seen disks run at 40C or more in "enterprise hardware".

From there on it is not amazing that they fail at a rate different from the quoted one. In fact I would have been very surprised if they did.

Re:Temperature is the key (1)

0123456 (636235) | more than 6 years ago | (#22974710)

From what I remember, the Google study showed that temperature made far less difference than had previously been believed (of course my memory may be past its MTBF).

Re:Temperature is the key (5, Interesting)

ABasketOfPups (1004562) | more than 6 years ago | (#22974756)

Google says that's just not what they've seen [google.com] . "The figure shows that failures do not increase when the average temperature increases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at the very high temperatures is there a slight reversal of this trend."

On the graph it's clear that 30-35C is best at three years. But up until then, 35-40C has lower failure rates, and both have lower rates by a lot than the 15-30C range.

Re:Temperature is the key (2, Insightful)

drsmithy (35869) | more than 6 years ago | (#22975166)

However, Google's data doesn't appear to have a lot of points when temperatures get over 45 degrees or so (as to be expected, since most of their drives are in a climate controlled machine room).

The average drive temperature in the typical home PC would be *at least* 40 degrees, if not higher. While it's been some time since I checked, I seem to recall the drive in my mum's G5 iMac was around 50 degrees when the machine was _idle_.

Google's data is useful for server room environments, but I'd be hesistant to extrapolate it to drives that aren't kept in a server room with a ~20 degrees C ambient temperature and have active cooling.

Re:Temperature is the key (3, Informative)

Jugalator (259273) | more than 6 years ago | (#22974772)

I agree, I had a Maxtor disk that ran at something like 50-60 C and wondered when it was going to fail, never really treated it as my safest drive. And lo and behold, after ~3-4 years the first warnings on bad sectors started cropping up, and a year later Windows panicked and told me to immediately back it up if I hadn't already because I guess the number of SMART errors were building up.

On the other hand, I had a Samsung disk that ran at 40 C tops, in a worse drive bay too! The Maxtor one had free air passage in the middle bay (no drives nearby), where the Samsung was side-by-side with the metal casing.

So I'm thinking there can be some measurable differences between drive brands, and a study of this, along with perhaps relationship with brand failure rates would be most interesting!

Re:Temperature is the key (1)

ViperAFK (960095) | more than 6 years ago | (#22974856)

The problem is you had a maxtor... No wonder it failed.

Re:Temperature is the key (1)

0123456 (636235) | more than 6 years ago | (#22975086)

"The problem is you had a maxtor... No wonder it failed."

I've had over a dozen Maxtors in the last decade and none have failed. Of course I replace them every 3-4 years because by then they've got too damn small.

The only drive I've had that did fail was an IBM, and even then I had plenty of advance warning so I could replace it before it was unusable.

Re:Temperature is the key (2, Informative)

drsmithy (35869) | more than 6 years ago | (#22975176)

On the other hand, I had a Samsung disk that ran at 40 C tops, in a worse drive bay too! The Maxtor one had free air passage in the middle bay (no drives nearby), where the Samsung was side-by-side with the metal casing.

Air is a much better insulator than metal.

Re:Temperature is the key (1)

afidel (530433) | more than 6 years ago | (#22975162)

Yeah my datacenter is 23-24C and the hottest disk bays in my SAN average about 37C. I don't care because my SAN is designed to lose an entire bay without losing data and the manufacturer is responsible for warranty replacement parts. So far in 22 months of operation it's lost three drive out of 160 and two of those were basically DOA with the other dying at about two months.

Well, duh! (1)

Jurily (900488) | more than 6 years ago | (#22974672)

Everyone who's ever had a hard drive already knows that.

Bring Me the Drivemaker (0)

KermodeBear (738243) | more than 6 years ago | (#22974674)

From a wonderful satire site Married to the Sea [marriedtothesea.com] , comes this little gem [marriedtothesea.com] .

Drive makers have always relied on questionable statistics and outright misrepresentation to make sales, and as we all know, statistics are worse than even damned lies.

I am not a supporter of industry regulation or class action lawsuits, I think that both are use far too much these days, but it would be nice if these companies were given a hard kick in the pants. They've gotten away with this for far too long.

My drives work great ... (0, Offtopic)

buchner.johannes (1139593) | more than 6 years ago | (#22974712)

My drives work great ... until someone comes along and puts stickers on other drives that say they are more "ready" than my drives.

Chicken/egg problem...sort of (1)

jmpeax (936370) | more than 6 years ago | (#22974722)

I don't think [disk array manufacturers are] going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the numbers.
By the same token, by acknowledging the realistic lifespan of their products, they could get customers to replace hard disks more often and therefore give them more business.

However, I strongly suspect that the problem lies in the fact that one manufacturer would have to be the first to change the documented lifespan of their products, and the danger is that unless their competitors follow, their products could be interpreted as inferior and they could lose a lot of business.

What about google? (1)

MMC Monster (602931) | more than 6 years ago | (#22974738)

Didn't Google present data on their disk failure rates? How about other large purchasers? Who cares if the manufacturers don't report them. If you have some very large purchasers report them, it may be more useful information, anyway.

Add value & Interpreting (1)

Nicolay77 (258497) | more than 6 years ago | (#22974744)

I would put the quotation marks around "add value" instead of adding them around "interpreting".

They are obviously interpreting the numbers.

How the hell can they be adding value is way beyond me.

Adding price, may be, but VALUE ????

Re:Add value & Interpreting (1)

drsmithy (35869) | more than 6 years ago | (#22975186)

How the hell can they be adding value is way beyond me.

By having larger amounts of data and more skill in interpreting it.

Build your own USB drives (3, Informative)

omnirealm (244599) | more than 6 years ago | (#22974786)

While we are on the topic of failing drives, I think it would be appropriate to include a warning about USB drives and warranties.

I purchased a 500GB Western Digital My Book about a year and a half ago. I figured that a pre-fab USB enclosed drive would somehow be more reliable than building one myself with a regular 3.5" internal drive and my own separately purchased USB enclosure (you may dock me points for irrational thinking there). Of course, I started getting the click-of-death about a month ago, and I was unpleasantly surprised to discover that the warranty on the drive was only for 1 year, rather than the 3 year warranty that I would have gotten for a regular 3.5" 500GB Western Digital drive at the time. Meanwhile, my 750GB Seagate drive in a AMS VENUS enclosure has been chugging along just fine, and if it fails sometime in the next four years, I will still be able to exchange it under warranty.

The moral of the story is that, when there is a difference in the warranty periods (i.e., 1 year vs. 5 years), it makes a lot more sense to build your own USB enclosed drive rather than order a pre-fab USB enclosed drive.

Re:Build your own USB drives (0)

Anonymous Coward | more than 6 years ago | (#22974924)

Agree 100% always build your own!

Yet another reason is that you void the warranty on a pre-built drive enclosure if you open it up. I had a 2.5" hammer enclosure that went bad. The drive itself was fine, but the usb bridge died.

I needed the data off of it so I took the drive out and popped it into another enclosure I had laying around, got my data and was happy.

When I called Hammer up to try and get a warranty repair or even PAY for a replacement for the USB controller portion, they told me that my warranty was right-out, and they won't even sell me the part that broke.

Meh... a stand alone enclosure is like $10 anyway.

Re:Build your own USB drives (1)

line-bundle (235965) | more than 6 years ago | (#22974940)

Did you check/confirm with Western Digital? I bought the My Book World edition. It was clearly written "3 year warranty" on the box, but when I registered it it only said 1 year. After raising a stink they changed my online registration warranty to 3 years.

Needless to say, it's my last WD drive. Their service suxk.

I HATE Those Google Ads (1)

Fat Wang (1230914) | more than 6 years ago | (#22974798)

SLASHDOT. Get rid of those god damn google ads.

MTBF rate calculation method is flawed (2, Insightful)

DonChron (939995) | more than 6 years ago | (#22974826)

Drive manufacturers take a new hard drive, run a hundred drives or so for some number of weeks, and measure the failure rate. Then they extrapolate that failure rate out to thousands of hours... So, let's say one in 100 drives fail in a 1000-hour test (just under six weeks). MTBF = 100,000 hours, or 11.4 years!

To make this sort of test work, it must be run over a much longer period of time. But in the process of designing, building, testing and refining disk drive hardware and firmware (software), there isn't that much extra time to test drive failure rates. Want to wait an extra 9 months before releasing that new drive, to get accurate MTBF numbers? Didn't think so. How many different disk controllers do they use in the MTBF tests, to approximate different real-world behaviors? Probably not that many.

Could they run longer tests, and revise MTBF numbers after the initial release of a drive? Sure, and many of them do, but that revised MTBF would almost always be lower, making it harder to sell the drives. On the other hand, newer drives are certainly available every quarter, so it may not be a bad idea to lower the apparent value of older drive models.

So, it's better to assume a drive will fail before you're done using it. They're mechanical devices with high-speed moving parts, very narrow tolerable ranges of operation (that drive head has to be far enough away from the platters not to hit them, but close enough to read smaller and smaller areas of data). Anyone who's worked in a data center, or even a small server room, knows that drives fail. When I've had around two hundred drives, of varying ages, sizes and manufacturers, in a data center, I observed a failure rate of five to ten drives per year. This is well below the MTBF for enterprise disk array drives (SCSI, FC, SAS, whatever), but drives fail. That's why we have RAID. Storage Review has a good overview of how to interpret MTBF values from drive manufactures [storagereview.com] .

I don't know what you people do to your drives (2, Interesting)

gelfling (6534) | more than 6 years ago | (#22974834)

But since 1981 I have had exactly zero catastrophic PC drive crashes. That's not to say I haven't seen some bad/relocated sectors, but hard failures? None. Granted that's only 20 drives. But in fact in my experience in PC's, midranges and mainframes in almost 30 years I have seen zero hard drive crashes.

All my hard drives eventually failed (1)

dfcamara (1268174) | more than 6 years ago | (#22974934)

From my first 200MB Seagate (bought in 1993) to a 20GB Maxtor that failed last year. Fortunately they fail when they're no longer my primary drive. I would say they last something about 5-6 years...

MTBF is a useful statistical measure (3, Insightful)

Kupfernigk (1190345) | more than 6 years ago | (#22974942)

which many people confuse with MTTF (mean time to failure) - which is relevant in predicting the life of equipment. It needs to be stated clearly that MTBF applies to populations; if I have 1000 hard drives with a MTBF of 1 million hours, I would on average expect one failure every thousand hours. These are failures rather than wearouts, which are a completely different phenomenon.

Anecdotal reports of failures also need to consider the operating environment. If I have a server rack, and most servers in the rack have a drive failure in the first year, is it the drive design or the server design? Given the relative effort that usually goes into HDD design and box design, it's more likely to be due to poor thermal management in the drive enclosure. Back in the day when Apple made computers (yes, they did once, before they outsourced it) their thermal management was notoriously better than that of many of the vanilla PC boxes, and properly designed PC-format servers like the HP Kayaks were just as expensive as Macs. The same, of course, went for Sun, and that was one reason why elderly Mac and Sparc boxes would often keep chugging along as mail servers until there were just too many people sending big attachments.

One possibly related oddity that does interest me is laptop prices. The very cheap laptops are often advertised with optional 3 year warranties that cost as much as the laptop. Upmarket ones may have three year warranties for very little. I find myself wondering if the difference in price really does reflect better standards of manufacture so that the chance of a claim is much less, whether cheap laptops get abused and are so much more likely to fail, or whether the warranty cost is just built into the price of the more expensive models because most failures in fact occur in the first year.

RAID, If You Really Care (2, Insightful)

crunchy_one (1047426) | more than 6 years ago | (#22974982)

Hard drives have been becoming less and less reliable as densities increase. Seagate, WD, Hitachi, Maxtor, Toshiba, heck, they all die, often sooner than their warranties are up. They're mechanical devices, for crying out loud. So here's a bit of good advice: If you really care about your data, use a RAID array with redundancy (RAID 1 or 5). It will cost a bit more, but you'll sleep better at night. Thank you all for your kind attention. That is all.

Typical misleading title (and bad article) (2, Insightful)

oren (78897) | more than 6 years ago | (#22975198)

Disk reliability metrics are much more science than myth. Like all science, this means you actually need to put some minimal effort into understanding them. Unlike myths :-)

Disks have two separate reliability metrics. The first is their expected life time. In general disks failure follows a "bathtub distribution". They are much more likely to fail at the first few weeks of operation. If they make it past this phase, they become very reliable - for a while anyway. Once their expected lifetime is reached, their failure rate starts steeply climbing.

The often quoted MTBF numbers express the disk reliability during the "safe" part of this probability distribution. Therefore, a disk with an expected lifetime of, say, 4 years, can have an MTBF of 100 years. This sounds theoretical until you consider that if you have 200 of such disks, you can expect that on average one of them will fail each year.

People running large data warehouses are painfully aware of these two separate numbers. They need to replace all "expired" disks, and also have enough redundancy to survive disk failures in the duration.

The article goes so far as to state this:

"When the vendor specs a 300,000-hour MTBF -- which is common for consumer-level SATA drives -- they're saying that for a large population of drives, half will fail in the first 300,000 hours of operation," he says on his blog. "MTBF, therefore, says nothing about how long any particular drive will last."

However, this obviously flew over the head of the author:

The study also found that replacement rates grew constantly with age, which counters the usual common understanding that drive degradation sets in after a nominal lifetime of five years, Schroeder says.

Common understanding is that 5 years is a bloody long life expectancy for a hard disk! It would take divine intervention to stop failures from rising after such a long time!

MTBF assumes drives are replaced every few years (3, Informative)

AySz88 (1151141) | more than 6 years ago | (#22975254)

MTBF is only valid during the "lifetime" of a drive. (For example, "lifetime" might mean the five years during which a drive is under warranty.) Thus, the MTBF is the mean time before failure if you replace the drive every five years with other drives with identical MTBF. Thus the 100-some year MTBF doesn't mean that an individual drive will last 100+ years, it means that your scheme of replacing every 5 years will work for an average time of 100+ years.
Of course, I think this is another deceptive definition from the hard drive industry... To me, the drive's lifetime ends when it fails, not "5 years".
Source: http://www.rpi.edu/~sofkam/fileserverdisks.html [rpi.edu]
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>