Beta

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Ask Slashdot: How Do You Test Storage Media?

timothy posted more than 2 years ago | from the give-her-some-storage-tarot-cards dept.

Data Storage 297

First time accepted submitter g7a writes "I've been given the task of testing new hardware for the use in our servers. For memory, I can run it through things such as memtest for a few days to ascertain if there are any issues with the new memory. However, I've hit a bit of a brick wall when it comes to testing hard disks; there seems to be no definitive method for doing so. Aside from the obvious S.M.A.R.T tests ( i.e. long offline ) are there any systems out there for testing hard disks to a similar level to that of memtest? Or any tried and tested methods for testing storage media?"

cancel ×

297 comments

Sorry! There are no comments related to the filter you selected.

SpinRite (2, Informative)

alphax45 (675119) | more than 2 years ago | (#39562185)

Re:SpinRite (2, Informative)

Anonymous Coward | more than 2 years ago | (#39562297)

Does their product actually do anything these days? Seems like the last time I used it was when you had the choice of an ARLL or RLL disk controller, haha...

Anyway, I always stress test my drives with IOMeter. Leave them burning on lots of random IOP's to stress test the head positioners and don't forget to power cycle them a good number of times. Spin-up is when most drive motors will fail and when the largest in-rush current occurs.

Re:SpinRite (2, Informative)

alphax45 (675119) | more than 2 years ago | (#39562317)

Still works 100% as HDD tech is still the same - just don't use on SSD's

Re:SpinRite (4, Informative)

SuperTechnoNerd (964528) | more than 2 years ago | (#39562855)

Spinrite is not that meaningful thees days since drives don't give you that low level access to the media like the days of old. Since you can't low level format drives which was one of spinrites strong points, save money and use badblocks. Use badblocks in read/write mode with random test pattern or worst case test pattern a few times. Then do a SMART long self test. Keep an eye on the pending sector count and the reallocated sector count. A drive can remap a difficult sector without you ever knowing unless you look there. Also keep an eye on drive temperature, even a new drive can act flaky if it gets too hot.

Re:SpinRite (2)

yacc143 (975862) | more than 2 years ago | (#39563159)

LOL, RLL harddiscs had capacities that are by today standards located somewhere between the CPU cache size and the RAM size of average smartphones.
(My first PC had a 20MB HDD)

Put simply, a modern hdd are about the same as a RLL hdd, as a Cadillac is similar to a tryke.

Re:SpinRite (1)

rickb928 (945187) | more than 2 years ago | (#39562591)

SpinRite is excellent for testing. If your drives run as hot as the old Hitatchi drives did, it doubles as a space heater or makeshift stove.

Seriously, SpinRite will exercise a drive very well indeed. And it will tell you more than the manufacturer wanted you to know.

Re:SpinRite (1)

NIN1385 (760712) | more than 2 years ago | (#39562311)

SpinRite is a good one, have used.

Re:SpinRite (0)

Anonymous Coward | more than 2 years ago | (#39562459)

Isn't that a data recovery program? I recommend GWScan based off of wd data lifegaurd tools, supports all drives.

Re:SpinRite (2)

0p7imu5_P2im3 (973979) | more than 2 years ago | (#39562675)

While SpinRite does a good job of recovering data temporarily on bad drives, it's intended use is to exercise the drive's SMART controller so that it will check the drive for problems more often, thus moving data from bad sectors before they fail completely. This has the fortunate side effect of reporting whether a drive is past it's stable use lifetime as well as other basic statistics regarding normal drive use.

Re:SpinRite (1)

cpu6502 (1960974) | more than 2 years ago | (#39562503)

What about motor failure? My last drive became inaccessible when the motor stopped spinning (6 months continuous spin, followed by power failure, followed by no spin).

Re:SpinRite (5, Funny)

NEDHead (1651195) | more than 2 years ago | (#39562681)

Did you restore power?

Re:SpinRite (1)

intellitech (1912116) | more than 2 years ago | (#39562631)

SpinRite is good. I've also used Hitachi's DFT (extensively) and the PC-Check Suite (used while working for Geek Squad), which has a really nice stress testing routine for hard disks.

Re:SpinRite (1)

CanHasDIY (1672858) | more than 2 years ago | (#39562801)

Anything coming from GRC is, IMO, awesome by default.

cheese (-1)

Anonymous Coward | more than 2 years ago | (#39562189)

first?

got it in one (4, Funny)

Quiet_Desperation (858215) | more than 2 years ago | (#39562219)

I've hit a bit of a brick wall when it comes to testing hard disks

Have you tried throwing them against the brick wall?

Re:got it in one (1)

K. S. Kyosuke (729550) | more than 2 years ago | (#39562263)

I think he means that he found so many bricks between the purchased disks that he can build a wall out of them.

mhdd (0)

Anonymous Coward | more than 2 years ago | (#39562231)

mhdd will test each sector and time it takes to acess, you can blacklist weak/slow sectors. Bout the best I know for disk integrity.

Why? (5, Insightful)

headhot (137860) | more than 2 years ago | (#39562233)

Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.

Re:Why? (1)

Shikaku (1129753) | more than 2 years ago | (#39562261)

The point is to know whether it's faulty now at the time of arrival rather then 2 weeks down the line where it becomes a problem.

Re:Why? (4, Insightful)

Shagg (99693) | more than 2 years ago | (#39562319)

No, the point is to design your system so that if it fails 2 weeks down the line... it isn't a problem.

Re:Why? (2)

na1led (1030470) | more than 2 years ago | (#39562423)

I agree. You can't test when a piece of hardware is going to fail. I've purchased many Hard Drives for our servers, sometimes they last years, sometimes they fail after a few weeks. There is no way to tell.

Re:Why? (1)

Sir_Sri (199544) | more than 2 years ago | (#39562537)

That isn't really true. Lots of hard drives have various states of failure, and you might be able to write data to it even if it has SMART errors. There isn't a universal way to tell if a drive is going to permanently die.

A classic example is a hard drive 'clicking'. The read head is contacting something intermittently, but it may still appear to work. You want to get that data off and onto another drive ASAP. Now if you get a drive out of the box like that, there's no point in even putting it into a machine to need to deal with it later. Unfortunately lots of problems can't be noticed with a visual or audio inspection.

You should still be prepared to recover when, inevitably, drives will fail without warning, but that's not the same as verifying equipment before it fails. It's also the same problem as trying to figure out if a drive in a raid 5 has actually failed (or otherwise has a physical problem) or if there is something wrong with your software.

Re:Why? (1)

ByOhTek (1181381) | more than 2 years ago | (#39562433)

I think both are important.

If you have the time to test now, it will save you the hassle of swapping it out later.

Re:Why? (4, Insightful)

Joce640k (829181) | more than 2 years ago | (#39562533)

Point is: You can't 'test'.

You can only tell if it's working, not when it's about to fail.

  If people could predict when hard drives were going to fail we wouldn't need RAID or backups.

Re:Why? (2, Informative)

jeffmeden (135043) | more than 2 years ago | (#39562511)

Hard drives, amazingly, are tested pretty effectively before leaving the factory. During tests in a controlled environment it was demonstrated that hard drives show no "failure curve" at onset, but follow a very boring, linear progression throughout their lifespan. The result: if you don't screw up when you install it you have little to worry about on day 1 that is different from day 1000, which is the cold reality that all mechanical devices will fail.

Cue the "but I have seen so many DOA drives from XYZcorp..." and to that I will pre-retort with this: if you buy a quality drive (i.e. not a refurb or one specifically designed as a consumer throwaway) from a vendor that takes some care in shipping and handling, then no you did not stumble on "the conspiracy of XYZcorp's bad drives". The weakest link was you. Try wearing a static strap next time.

Re:Why? (1)

TheRealMindChild (743925) | more than 2 years ago | (#39562645)

A plastic strap won't save you from the drive head failing to move. I've seen this happen when a bunch of unemployed temp workers unload the truck. This is why it seems "batches" of similar drives fail if you are getting them from the same source... some asshole was throwing and kicking the boxes around.

Re:Why? (2, Insightful)

jeffmeden (135043) | more than 2 years ago | (#39562749)

A plastic strap won't save you from the drive head failing to move. I've seen this happen when a bunch of unemployed temp workers unload the truck. This is why it seems "batches" of similar drives fail if you are getting them from the same source... some asshole was throwing and kicking the boxes around.

If your static strap is made of (all) plastic, then you will have issues beyond shipping and handling woes...

Re:Why? (2)

K. S. Kyosuke (729550) | more than 2 years ago | (#39562933)

During tests in a controlled environment it was demonstrated that hard drives show no "failure curve" at onset, but follow a very boring, linear progression throughout their lifespan.

Interesting. Didn't the Google study on disk reliability [slashdot.org] show a distinct infant mortality spike in the beginning with a lowest failure rate between 1-2 years of age and after 2 years of age a sharp increase in failure rate quickly reaching a certain plateau? What you describe seems to be quite different.

Re:Why? (1)

NIN1385 (760712) | more than 2 years ago | (#39562359)

Yeah, best advice for any data storage:

BACKUP, then back that up... then back that up... then back that up offsite.

Re:Why? (1)

ColdWetDog (752185) | more than 2 years ago | (#39562409)

Yo Dawg....

No, no. Won't do that.

Re:Why? (1)

NIN1385 (760712) | more than 2 years ago | (#39562441)

Haha, I was thinking the same shit while I was typing that. That and an older rap song about backing that ass up.

Re:Why? (5, Insightful)

gregmac (629064) | more than 2 years ago | (#39562379)

Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.

And then what you should test is that it actually notifies you when something does fail, so you know about it and can fix it. You can also test how long it takes to rebuild the array after replacing a disk, and how much performance degradation there is while that is happening.

Re:Why? (1)

Copperhamster (1031604) | more than 2 years ago | (#39563051)

Even better: Raid 6, with hot spare, cold spare on the shelf, and a unit that supports regular self-consitancy checking and automatic failure notification. My primary nas even has a wireable relay trigger, which is hooked to turn on a $20 spinning red light (old cop car style) sitting on top of the cabinet when there's an alert.

It's also powered by two ups's (one for each power supply) and supports network controlled shutdown on both.

If you can, order the drive packs (we got 2 packs of 4) from different vendors to minimize the chance of getting the same 'lot' of drives. Look at the amount of storage you need and get the minimum size drives... because they rebuild faster you are at less risk of a multi failure. I'd much rather have 12x 1 TB drives than 4x 3TB drives.

(And if you scoff at Raid 6, I've had a second drive fail hard during the rebuild when the system detected a probable failure on a drive and started to rebuild with the hot spare at 3 am...)

Also, backup backup backup.

If you need speed of course, you want raid 1+0. That's fine, my rule of thumb is:
Start with one hot spare, one cold spare.
After each 3rd mirror pair, add another hot spare. (so 6 total in use drives needs 2 hot spares)
Add another cold spare after every other hot spare.

Cold spares should be testable and tested. I will swap them out with the hot spares once a month.
But I'm paranoid.
Also:
Backup Backup Backup. RINB.

Also: HARDWARE RAID CARDS.

I can't stress that enough. software and semi-software raid is a joke.

Re:Why? (0)

Anonymous Coward | more than 2 years ago | (#39562429)

I worked for a 911 emergency response centre. We had redundant raid arrays, one configured as raid 0+1 and the other configured as raid 5. Whatever wear pattern wears the drive out first will fail (and we will have graceful fail over). Then when the other array fails, the first will already be fixed and able to fail gracefully there too.

Re:Why? (0)

Anonymous Coward | more than 2 years ago | (#39562485)

RAID5 is toast. This was true in 2009 and is definitely more true now with 2TB-4TB size disks.

https://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162

Use RAID6 with one hotspare. And a proper backup solution. RAID is not backup.

Re:Why? (1)

CSMoran (1577071) | more than 2 years ago | (#39562523)

Even if your storage passes the test, it could fail the next day.

Similarly with medical checkups. Why bother, when you can get cancer the next day.

Sarcasm aside, screening is not meant to guarantee lack of failure, but rather allow you to sort out clearly defective hardware.

Re:Why? (1)

Anonymous Coward | more than 2 years ago | (#39562633)

That's a decent textbook answer but in reality it's not quite that simple. The failure rate of hardware over time is not linear. There's a higher probability of failure in the beginning of a device's average lifetime than in the middle.

For example, railway systems are highly failsafe and redundant by design. Yet they "burn in" equipment like light bulbs for signals, i.e. they let them run for a few 100 hours in some warehouse before they are put into signals on the tracks. By doing so, they weed out parts that are more likely to cause maintenance overhead later on.

It's better to identify defective hardware before you put them into production systems.

Re:Why? (3, Insightful)

windcask (1795642) | more than 2 years ago | (#39562817)

Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.

All RAID 5 does is move the single point of failure from the disk itself to the RAID controller, which could also fail at any time. This is why a truly effective solution is virtual machine redundancy with seamless failover and a rigorous backup schedule.

Re:Why? (0)

Anonymous Coward | more than 2 years ago | (#39563145)

A lot of responses to this short-sighted posting assume we're all doing the same job. I send hardware to remote locations where servicing is difficult. Yes, we use RAID6 and we have a full up spare (and LTO backups, etc), but we still test drives before shipping, because we're not idiots. We do find drives with trouble (sometimes before they go bad) and replace them this way. Proper design can certainly include testing even if a bunch of Slashdot know-it-all's think otherwise.

Hard Drive Testing (2, Informative)

Anonymous Coward | more than 2 years ago | (#39562247)

In previous jobs, I've used the system of:
Full Format, Verify, Erase, then a Drive fitness test.
If there are errors in media, the Format, verify or erase will pick it up, then the fitness test to check the hardware.
Hitachi has a Drive Fitness test program
I have also used hddllf (hddguru.com)

S.M.A.R.T. (1)

NIN1385 (760712) | more than 2 years ago | (#39562249)

It's a joke. I've seen drives work fine for years with it showing imminent drive failure and I've seen drives die instantly with no warning given whatsoever.

There is no perfect tool that I could say, each drive manufacturer makes their own, and there are numerous third party tools out there as well. My best advice is have them all and have them handy. One I use quite a bit is HDD Regenerator, pretty thorough utility but it takes some time to run.

Re:S.M.A.R.T. (1)

NIN1385 (760712) | more than 2 years ago | (#39562287)

Edit: HDD Regen is more aimed at repair just fyi.

Re:S.M.A.R.T. (4, Informative)

DigiShaman (671371) | more than 2 years ago | (#39562445)

S.M.A.R.T is a joke, but not in implementation. It's a joke because most HDD failures occur on the logic board. It's a known fix in data recovery services to simply swap out the PCB for another of the same vintage make/model/firmware rev. Though I have ran tools such as HD Tune to view out-of-spec metrics and benchmarks. For example, I once had a user that reported that her workstation was running extremely slow. I suspected the drive was at fault and the graphs proved it, but technically it wasn't a failure. S.M.A.R.T would have flagged it if it was mechanical, but it wouldn't have if it was a controller issue. Now that may have changed with newer drives, but that's been my overall experience.

Re:S.M.A.R.T. (1)

greed (112493) | more than 2 years ago | (#39562643)

I've seen masses of cabling issues that won't be reported by SMART, either.

The symptom, at least on Linux, is logs full of stuff like this:

kernel: ata12.01: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6
kernel: ata12.01: irq_stat 0x03060002, device error via SDB FIS
kernel: ata12.01: failed command: READ FPDMA QUEUED
kernel: ata12.01: cmd 60/40:00:92:6d:06/00:00:07:00:00/40 tag 0 ncq 32768 in
kernel: res 41/84:40:92:6d:06/00:00:07:00:00/00 Emask 0x410 (ATA bus error)
kernel: ata12.01: status: { DRDY ERR }
kernel: ata12.01: error: { ICRC ABRT }

That one is actually a dodgy port replicator board--the drives never see the garbled command packets, so their CRC error count never moves.

A consistent comm problem to the drive itself should result in at least some of the SMART counters moving, but they will NOT fail out the drive because there is no reliable evidence it is a drive problem. For those, re-seat the SATA/SAS cables, reseat the HBA in the PCI/PCIe slot, replace the SATA/SAS cables, replace the HBA, replace the drive. In about that order--there's a lot of crappy cables on the market, and quality independent of retail price.

(I'd recommend having a test rig with a couple of different HBAs so you can determine which part is giving you grief; motherboard and a cheap PCI card is usually enough variety.)

Re:S.M.A.R.T. (1)

ethan0 (746390) | more than 2 years ago | (#39562477)

SMART is good for telling you when your drives do have problems that need addressing. it's not so great for giving you assurance that your drives do not have problems - consider a positive smart result to be more of an "I don't know" than a "good". you should generally assume your drives can fail at any time. I don't think there's any way to reliably predict the sudden death of a drive.

scsi (2)

rs79 (71822) | more than 2 years ago | (#39562255)

don't use consumer drives if you're concerned.

see also http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/archive/disk_failures.pdf [googleusercontent.com]

The Goog wrote a nice paper on hard drives.

Re:scsi (0)

na1led (1030470) | more than 2 years ago | (#39562465)

That's total BS. SCSI Hard Drives can fail just as often as a cheap IDE Drive. Anything mechanical is prone to failure

Re:scsi (1)

pak9rabid (1011935) | more than 2 years ago | (#39562881)

Yes, but consumer-level drives are more prone to failure than their enterprise counterparts. It's a known fact that enterprise-level hard drives are built more reliably. If you don't believe me, then check this out [intel.com] .

However, with proper redundancy one can still get away with using consumer-level drives with an acceptable level of risk.

Re:scsi (1)

bacon.frankfurter (2584789) | more than 2 years ago | (#39562495)

Error 404 (Not Found)!!1

Google

404. That’s an error.

The requested URL /external_content/untrusted_dlcp/research.google.com/en/archive/disk_failures.pdf was not found on this server. That’s all we know.

Re:scsi (3, Insightful)

Galactic Dominator (944134) | more than 2 years ago | (#39562627)

Perhaps an honest mistake, the link is broken. Second, evidence has shown SATA are more reliable than commercial/enterprise grade drive. Only buy those if you don't like your money, or there is some clear advantage. That supposed advantage is not reliability, unless there is there is some sort of rapid replacement mechanism coming with the drive. Although replacement isn't reliability in my book.

http://lwn.net/Articles/237924/ [lwn.net]

Re:scsi (1)

Trubacca (941152) | more than 2 years ago | (#39562971)

Seems like sound advice. Thank you for tracking down and providing a functional link, it was a good read. Your post would have received a mod point if I had any!

standard tools (1)

Anonymous Coward | more than 2 years ago | (#39562265)

I look at the SMART data, then I run "fsck -f -c" to test all blocks on the drive, then I look at SMART data again to see if there have been any read errors or remapped sectors. Next, I run dd if=/dev/zero of=/dev/sdx (where sdx is the new drive), to write all sectors. I look at the SMART data again, and repeat the fsck/dd commands as many times as I need. This can easily be scripted, and you can do some random writing as well to exercise the drives seek characteristics.

ZFS and a stress test (0)

Anonymous Coward | more than 2 years ago | (#39562273)

ZFS is focused mainly on integrity, so just set copies to 10, with checksuming, and stress it with filling up files, an occasional scrub, and so forth. If there's a problem, zfs will report it.

Memtest is fine (1)

StoutFiles (2471680) | more than 2 years ago | (#39562289)

Then have ghosting software auto backup periodically.

Testing it for what exactly (0)

jeffmeden (135043) | more than 2 years ago | (#39562295)

You can test that the drive works pretty easily, put it in a PC, copy a bunch of files to it (perhaps enough to fill it up), then run MD5 on those files vs the originals. That would be the "pedantic" way to test it, for "turbo-pedantic" (a bit like running memtest for 72 hours) you can test this way for your entire MP3 collection, then test again for your entire Quantum Leap upscaled 720p dvdrips collection.

For more practical testing, most drive manufacturers offer "validation" software tools for RMA purposes to test low-level operations and performance, and most of them are generic to the extent that you can actually test any make of drive the same way. It's free and it works, what more could you ask for?

Re:Testing it for what exactly (0)

Anonymous Coward | more than 2 years ago | (#39563069)

That's crazy. You don't upscale while ripping, you do it during playback.

Jet Stress (2)

jader3rd (2222716) | more than 2 years ago | (#39562299)

Jet Stress [microsoft.com] does a good job of runnig the storage media through a lot of work.

Don't waste your time (1)

colin_faber (1083673) | more than 2 years ago | (#39562303)

If you're concerned about drive performance and reliability don't waste your time on off-the-shelf junk. Buy actual enterprise class drives from distributors which pay many dollars to have each and every drive tested for both performance and reliability in varying environmental conditions.

Try Hitachi Drive Fitnes Test (DFT) (0)

Anonymous Coward | more than 2 years ago | (#39562307)

Has several levels of testing including a full blown exerciser. I've found it very effective for detecting the slightest drive problems. It's available for download from multiple sources.

The usual (5, Informative)

macemoneta (154740) | more than 2 years ago | (#39562309)

All I usually do is:

1. smartctl -AH
Get an initial baseline report.

2. mke2fs -c -c
Perform a read/write test on the drive.

3. smartctl -AH
Get a final report to compare to the initial report.

If the drive remains healthy, and error counters aren't incrementing between the smartctl reports, it's good to go.

endless loop of bonnie++ (-1, Flamebait)

vlm (69642) | more than 2 years ago | (#39562313)

Endless loop of bonnie++

I suppose not knowing bonnie++ means its newbie sysadmin day here. So an endless loop script looks a heck of a lot like this:

rm -Rf /

Oh no wait just kidding about that one. I meant to type (hopefully no typos) :

#!/bin/bash
# the name of this file is endlessbonnietest.sh and it had chmod a+x endlessbonnietest.sh run on it
while [ 42 = 42 ]
do
    bonnie++ "whole bunch of interesting b++ options go here without quotes obviously"
done

Obviously this works a hell of a lot better after running apt-get install bonnie++ and if for some godforsaken reason you are bashless, apt-get install bash

thats all folks!

Use it (0)

W1sdOm_tOOth (1152881) | more than 2 years ago | (#39562321)

That's it...

Take a hint (0)

Anonymous Coward | more than 2 years ago | (#39562331)

Disappear for 4 months and come back and say they are good. Even if you test there is no reason that hardware can't fail at any point after the test. That's why we buy redundancy and support.

Hitachi DFT (0)

Anonymous Coward | more than 2 years ago | (#39562347)

I've always had good results with the Hitachi Drive Fitness Test. Works fine with non-Hitachi drives too.

This is what I use (3, Interesting)

Wolfrider (856) | more than 2 years ago | (#39562363)

root ~/bin # cat scandisk
#!/bin/bash

# RW scan of HD
argg='/dev/'$1

# if IDE (old kernels)
hdparm -c1 -d1 -u1 $argg

# Speedup I/O - also good for USB disks
blockdev --setra 16384 $argg
blockdev --getra $argg

#time badblocks -f -c 20480 -n -s -v $argg
#time badblocks -f -c 16384 -n -s -v $argg
time badblocks -f -c 10240 -n -s -v $argg

exit;

---------

Note that this reads existing content on the drive, writes a randomized pattern, reads it back, and writes the original content back. With modern high-capacity over-500GB drives, you should plan on leaving this running overnight. You can do this from pretty much any linux livecd, AFAIK. If running your own distro, you can monitor the disk I/O with ' iostat -k 5 '.

From ' man badblocks '
-n Use non-destructive read-write mode. By default only a non-destructive read-only test is done. This option must not be combined with the -w option, as they are mutually exclusive.

QuickTech Pro (0)

Anonymous Coward | more than 2 years ago | (#39562365)

http://www.uxd.com/qtpro.shtml

This is, by far, the best hard drive (and hardware) testing suite I've ever used.

Do a Surface Scan (1)

na1led (1030470) | more than 2 years ago | (#39562391)

There are many free tools for doing a surface scans of a hard drive, that test for bad sectors. Usually it's bad sectors that cause Hard Drives to fail, and that's all you can really test anyway. Hard Drives will fail, it's just a mater of time, that's why you need redundancy. Other than testing, keeping the System Cool, and Dust Free is all you can do.

It isn't worth it. (0)

Anonymous Coward | more than 2 years ago | (#39562405)

Modern failure modes tend to be catastrophic, you won't find bad sectors on a hard drive these days. The drives have so much error correcting and sector re-mapping that the very act of writing to a bad portion of the platter will silently correct and remap the sector. The main way you can see failures is to write data, do not read it for a *long* time, then get a read failure. Plus the initial part of the bathtub curve is in months not days, so testing for reliability is really not something you can do.

Hard Disk Sentinel (3, Insightful)

prestonmichaelh (773400) | more than 2 years ago | (#39562413)

Hard Disk Sentinel: http://www.hdsentinel.com/ [hdsentinel.com] is a great tool They even have a free Linux client. What it does over SMART is that it takes the SMART data and weights them according to indications of failure, then gives you a score of 0-100 (100 being great, 0 being dead) as to how healthy the drive is. We use this extensively and have created NAGIOS scripts that monitor the output. Generally, if a drive has a score of 65 or higher, I will generally continue using it (pretty much all my setups are RAID 10 or RAID 6). If the score starts dropping rapidly (a few points every day, even if it started high) or gets below 65 or so, I go ahead and replace it. It has helped out a bunch.

Even with that, using the SMART data, in a SMART way, still only predicts about 30% of failures. The other 70% will come out of no where. That is why it is best to assume all drives will die at anytime and are suspect and never allow a single drive to be the sole copy of anything.

Think Performance - IOZONE (1)

humphrm (18130) | more than 2 years ago | (#39562419)

When it comes to media, even with SMART your drives will work 'till they die, and there's no way to predict that with a test.

Given that, your best option is to ensure that the drives are performing as expected. I've found many a faulty drive with IOZONE.

http://www.iozone.org/ [iozone.org]

old timers look here (1, Interesting)

vlm (69642) | more than 2 years ago | (#39562427)

OK so that was the noob version of the question.

I have a question for the old timers. has anyone ever implemented something like:
1) log the time and temp
2) do a run of bonnie++ or a huge dd command
3) log the time and temp
4) Repeat above about ten times
5) numerical differentiation of time and temp and also any "overtemps"

In theory run from a cold or lukewarm start that could detect a drive drawing "too much" current or otherwise being F'd up, or cooling fan malfunction
I'm specifically looking for rate of temp increase as in watts expended, not just static workload temp.
In practice it might be a complete waste of time.

Another one might be something like a smart reported temp vs iostat reported usage plotted on a scatterplot graph.

So the old timer question is has anyone ever bothered to implement this, and if so, did it do anything useful other than pad your billable hours?

Re:old timers look here (1)

na1led (1030470) | more than 2 years ago | (#39562637)

Seems like a big waste of time to me.

Re:old timers look here (1)

prestonmichaelh (773400) | more than 2 years ago | (#39562913)

If the temps are in "operating ranges" which run higher than you might think (check with the hard drive manufacturer for specs), temperature doesn't correlate to drive failure:

http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/#more-337 [backblaze.com]
Look for the "lessons learned" section in that link.

Re:old timers look here (1)

Mister Liberty (769145) | more than 2 years ago | (#39563097)

Sounds interesting. For the oldtimer you claim to be you
really should have performed said tests and analyses
long time ago already.

Are you testing an array or individual drives? (4, Insightful)

HockeyPuck (141947) | more than 2 years ago | (#39562473)

I manage a team that oversees PB of disk, both within an enterprise array and internal to the server. For testing the arrays, since there's GB of cache in front of the disks, I can only rely on the vendor to do the appropriate post installation testing to make sure there are no DOA disks. For internal disks, as others have mentioned you could run IOMeter for days without a problem and then the very next day it's dead. Unlike memory, disks have moving parts that can fail much easier than chips. However, with proper precautions like RAID, single disk failures can be avoided.

The bigger problem is having a double disk failure. This is due to the amount of time required to rebuild the failed disk. Back when disks were 100GB this was a "relatively" quick process. However, in some of my arrays with 3TB drives in them, it can take much longer to replace the drive. Even to the point whereby having hotspares has been considered to be not worth it as my array vendor will have a new disk in the array within 4hrs. With what an enterprise disk costs from the array vendor (not Frys), it can start to add up.

Re:Are you testing an array or individual drives? (1)

na1led (1030470) | more than 2 years ago | (#39562983)

It depends on how critical your data is. There are many different types of Raids, like Raid 6, 10, or 50, and if you have a good storage unit with redundant controllers and a couple of hot spare, you should be all set. The chances of a total failure of multiple drives/controllers is very slim, and that's what nightly backups are for anyway. We use a Dell PowerVault MD 3220 - Dual Controllers, Dual Power Supplies, Raid 50 with 2 hot spares. Chances of losing data from Hard Drive failure is like winning the lottery.

badblocks (0)

Anonymous Coward | more than 2 years ago | (#39562483)

I generally run 'badblocks' (included in most linux disributions).

Reliability and fault-tolerance (5, Informative)

Mondragon (3537) | more than 2 years ago | (#39562493)

Not completely related to how to test, but...

In 2007 Google reported that for a sample of 100k drives, only 60% of their drives with failures had ever encountered any SMART errors. Also, NetApp has reported a significant amount of drives with temporary failures, such that they can be placed back into a pool after being taken offline for a period of time and wiped. Google also had a lot of other interesting things to say (such as heat has no noticeable effect on hard drive life under 45C, that load is unrelated to failure rates, and that if a drive doesn't fail after 3 months, it's very unlikely to fail until the 2-3 year timeframe.

You can find the google paper here: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf [googleusercontent.com]

A few other notes that you can find from storage vendor tech notes if you own their arrays:
  * Enterprise-level SAS drives aren't any more reliable than consumer SATA drives
    - But they do have considerably different firmwares that assume they will be placed in an array, and thus have a completely different self-healing scheme than consumer-level drives (generally resulting in higher performance in failure scenarios)
  * RAID 5 is a really bad idea - correlated failures are much more likely than the math would indicate, especially with the rebuild times involved with today's huge drives
  * You have a lot more filesystem options that might not even make sense to use with a RAID system, like ZFS, as well as other mechanisms for distributing your data at a layer higher than the filesystem

Ultimately the reality is that regardless of the testing you put them under, hard drives will fail, and you need to design your production system around this fact. You *should* burn them in with constant read/write cycles for a couple days in order to identify those drives which are essentially DOA, but you shouldn't assume any drive that passes that process won't die tomorrow.

rsync -nc (0)

Anonymous Coward | more than 2 years ago | (#39562527)

I mirror data and test it periodically with rsync using the dry-run (-n) and checksum options (-c) to do a full comparison. I usually have more confidence in a new disk after I've done this a few times.

StorageMojo (0)

Anonymous Coward | more than 2 years ago | (#39562531)

Read this for more info on disk storage

http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/

rsync -nc (1)

patniemeyer (444913) | more than 2 years ago | (#39562545)

I mirror data and test it periodically with rsync using the dry-run (-n) and checksum options (-c) to do a full comparison. I usually have more confidence in a new disk after I've done this a few times.

Lithophobia (1)

macraig (621737) | more than 2 years ago | (#39562561)

I have a favorite boulder that has served my burn-in testing needs pretty well. Would you like a photo so you can chisel your own? I added some LED bling to mine.

Can't even believe this made it to the front page. (1)

Ironhandx (1762146) | more than 2 years ago | (#39562567)

I mean, really, someone working at slashdot doesn't know this? This is about as basic a question as it gets when it comes to hardware.

Raid 5, a solid backup scheme, and a storage closet full of replacement drives. There is no good way to test HDDs.

Due entirely to the fact that they are a WEAR item it is only possible to decide which brand you trust the most.

Other than that, if its a big job and a lot of HDDs are going to be bought you could take 10 of each candidate drive and run them through spinrite till they fail. One that fails last wins.

SSDs are still too expensive to be using for mass storage, and should be treated the exact same way even if you ARE using them for mass storage regardless, as they are also a wear item, and will fail without warning after so many read/write operations, the same as a HDD.

Re:Can't even believe this made it to the front pa (0)

Anonymous Coward | more than 2 years ago | (#39562711)

Don't use raid 5 (write hole).
use 10 or simple mirroring.
(raidz should be ok).
Use the money saved on 2 expensive raid controllers (One fails you are stuffed if you cannot get another which is by no means guaranteed) to buy more disks.

ZFS (0)

Anonymous Coward | more than 2 years ago | (#39562585)

IIRC zfs supports online "data scrubbing" http://en.wikipedia.org/wiki/Data_scrubbing#In_RAID . This, combined with the other features of zfs can help you prevent data loss.

IOmeter (1)

jd2112 (1535857) | more than 2 years ago | (#39562597)

The storage team where I work use a program called IOmeter. Ot runs on Windows and Linux (and I think other platformd as well) and is open source. They were using it for stress testing SAN storage but I think it would work for locally attached storage as well.

testing on linux: use spew (1)

David Muir Sharnoff (73602) | more than 2 years ago | (#39562609)

http://linux.die.net/man/1/spew [die.net]

While you can't predict against future failures, if you want to make sure that your drive media is okay today, there is a tool that will fill your disk with garbage and then verify that your disk has the right garbage on it: spew. Spew isn't the friendliest tool, but it does the job.

As a side effect, it stresses your I/O systems and memory. Years ago, I discovered that some Dell 2550's I had couldn't pass this test with the SATA controller I had shoved into them that seemed to work fine otherwise.

UnRAID Preclear Script (3, Informative)

Jumperalex (185007) | more than 2 years ago | (#39562617)

http://lime-technology.com/forum/index.php?topic=2817.0 [lime-technology.com] ... the main feature of the script is
1. gets a SMART report
2. pre-reads the entire disk
3. writes zeros to the entire disk
4. sets the special signature recognized by unRAID
5. verifies the signature
6. post-reads the entire disk
7. optionally repeats the process for additional cycles (if you specified the "-c NN" option, where NN = a number from 1 to 20, default is to run 1 cycle)
8. gets a final SMART report
9. compares the SMART reports alerting you of differences.

Check it out. Its "original" purpose was to set the drive to all "0's" for easy insertion into a parity array (read: parity drive does not need to be updated if the new drive is all zeros) but it has also shown great utility as a stress test / burn-in tool to detect infant mortality and "force the issue" as far as satisfying the criteria needed for an RMA (read: sufficient reallocated block count)

If your skill level is enough to adapt the script to your own environ then great, otherwise UnRaid Basic is free and allows 3 drives in the array which should allow you to simultaneously pre-clear three drives. You might even be able to pre-clear more than that (up to available hardware slots) since you aren't technically dealing with the array at that point, but with enumerated hardware that the script has access to which should be eveything on the disc. Hardware requirements are minimal and it runs from flash.

Storage Unit is more important than the Drives (1)

na1led (1030470) | more than 2 years ago | (#39562619)

A good Storage Unit will do a good job at maintaining the Hard Drives you purchased and keeping them safe. They can also handle problems with drives to prevent data loss, and notify you when a drive is about to fail. A good SAN or DAS is what I would purchase. We purchased a Dell Powervault DAS, and have been very happy with it, I never worry about Hard Drives failing because I know the DAS will take care of it. Some companies like Dell and EMC will know if you have a bad hard drive, and ship you a new one before you realize it.

always good to do a full write with read verify (0)

Anonymous Coward | more than 2 years ago | (#39562687)

It's always good to do a full write with read verify on new media. For my own piece of mind, I wrote a Java application that fills a drive with pseudo-random data and then reads it back to make sure (1) the data is correct, and (2) the entire drive capacity can be accessed. Use this in addition to the many good hardware diagnostic tools (see other comments). As has been pointed out, this only tells you that the drive is working now, but can't predict when it will fail.

BLATANT ADVERTISEMENT: The Java program has been released under the GPL and can be found here (Linux, MacOS, Windows, etc): http://linux.softpedia.com/get/System/System-Administration/Erase-Disk-46749.shtml

vendors solved this problem years ago (1)

alen (225700) | more than 2 years ago | (#39562699)

we use HP servers and HP ships a suite of software to install on the server along with the OS. they monitor the hardware and warn you of any problems. unless you like doing things the hard way, this was solved years and years ago

i have a bad hard drive i call HP, send them a log file and in 2 hours i have a new one delivered

Testing is expensive R.A.I.D. is cheap (0)

Anonymous Coward | more than 2 years ago | (#39562715)

Just RAID your storage or better connect to a SAN and be done with it.

Hitachi DFT (1)

mr.bri (886912) | more than 2 years ago | (#39562721)

Hitachi's (previously IBM's) Drive Fitness Test is the most thorough disk test I've used. It works on all makes, and has a "drive exerciser" that can loop a test sequence.

I've seen it find problems with drives that the manufacturer's own tools don't expose.

My policy is that if a drive survives 20 loops of the exerciser and then a full extended test that it's fit for production service.

Testing hardware by excercising (1)

cvtan (752695) | more than 2 years ago | (#39562831)

Testing hardware by exercising it is like testing matches to see if they are good.

H2TestW - in particular for (often fake) USB media (1)

D4C5CE (578304) | more than 2 years ago | (#39562917)

http://www.heise.de/download/h2testw.html [heise.de] - switchable to English of course.
While it is primarily advertised for flash media these days (and indispensable since there have been numerous forgeries or DOAs at least on the European market lately), it evolved as an HDD tester in the first place.

On Linux in particular, a combination of dd and smartctl (before&after writing the entire disk, as well as for self-tests) may come in handy too, of course.

Testing Drives. (1)

Rashkae (59673) | more than 2 years ago | (#39562937)

It takes a while, but if you really want to be sure of your hardware (as sure as you can be, at least.)

Check the SMART status. If there are any re-allocated sectors, make note of the number.

Run badblocks with the -w switch against the drive (from a Linux live cd of your choice, for example)

That should completely read/write test the drive 4 times with multiple patterns. There should be no errors reported. This test will take longer than overnight on modern drives.

Check the SMART data again. Be wary if there has been an increase in Re-allocated sectors. This is considered normal and does not constitue drive failure. However, most drives should not have any reported re-allocations so early in life, and this may indiacate you have a drive of marginal quality.

Do not try this on SSD drives.

hire someone who knows what they are doing.. (1)

issicus (2031176) | more than 2 years ago | (#39563067)

or read a book

Ears (3, Informative)

Maximum Prophet (716608) | more than 2 years ago | (#39563075)

Most everything above is good, but don't overlook the obvious. Spin the drive up in a quiet room and listen to it. If it sounds different from all the other drives like it, there's a good chance something is wrong.

I replaced the drive in my TiVo. The 1st replacement was so much louder, I swapped the original back, then put the new drive in a test rig. It started getting bad sectors in a few days. RMA'd it to Seagate, and the new one was much quieter.

Just sent it to me (1)

damn_registrars (1103043) | more than 2 years ago | (#39563167)

Send your drives to me, postage paid. I'll test them for you for no charge, and send them back to you before the warranty expires.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?
or Connect with...

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>