Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Ask Slashdot: How Do SSDs Die?

timothy posted about 2 years ago | from the whimpery-bang dept.

Data Storage 510

First time accepted submitter kfsone writes "I've experienced, first-hand, some of the ways in which spindle disks die, but either I've yet to see an SSD die or I'm not looking in the right places. Most of my admin-type friends have theories on how an SSD dies but admit none of them has actually seen commercial grade drives die or deteriorate. In particular, the failure process seems like it should be more clinical than spindle drives. If you have X many of the same SSD drive and none of them suffer manufacturing defects, if you repeat the same series of operations on them they should all die around the same time. If that's correct, then what happens to SSDs in RAID? Either all your drives will start to fail together or at some point, your drives will become out of sync in-terms of volume sizing. So, have you had to deliberately EOL corporate grade SSDs? Do they die with dignity or go out with a bang?"

cancel ×

510 comments

Sorry! There are no comments related to the filter you selected.

CRC Errors (5, Informative)

Anonymous Coward | about 2 years ago | (#41670125)

I had 2 out of 5 SSDs fail (OCZ) with CRC errors, I'm guessing faulty cells.

Re:CRC Errors (3, Interesting)

Quakeulf (2650167) | about 2 years ago | (#41670255)

How big in terms of gigabytes were they? I have two disks from OCZ myself and they have been pretty fine so far. The biggest is 64 gb, the smallest is 32 gb. I was thinking of upgrading to a 256 gb SSD at some point but not knowing what might kill it is something I honestly have not thought of, and would like some input on. My theory is heat and a faulty power supply would play major roles in this, but not so sure about physical impact although to some extent it would break it.

Re:CRC Errors (5, Informative)

Anonymous Coward | about 2 years ago | (#41670415)

OCZ has some pretty notorious QA issues with a few lines of their SSDs, especially if your firmware isn't brand spanking new at all times.

I'd google your drive info to see if yours are on death row. They seem a little small (old) for that, since I only know of problems with their more recent, bigger drive.

Re:CRC Errors (4, Interesting)

Synerg1y (2169962) | about 2 years ago | (#41670647)

OCZ makes several different product lines of SSDs, each line has it's own quirks, so generalizing OCZ's QA issues isn't accurate. I've always had good luck with the vertex 3s both for myself & people I install them for. I've a SSD die once and it looked identifcal to a spinning disk failure from a chkdsk point of view, can't remember what kind it was, it was either an OCZ, or a Corsair, but I can name a ton of both of those brands that are still going 2-3y+.

Re:CRC Errors (1)

Synerg1y (2169962) | about 2 years ago | (#41670667)

I actually mean Agility 3, can't say either way on the vertex's besides they sometimes get less eggs on newegg from reviews.

Re:CRC Errors (2)

AmiMoJo (196126) | about 2 years ago | (#41670571)

You could be more specific. Errors on reading or writing?

I have had a couple of SSDs die. The first was an Intel and ran out of spare capacity after about 18 months, resulting in write failures and occasional CRC errors on read. The other was an Adata and just died completely one day, made the BIOS hang for about five minutes before deciding nothing was connected.

Re:CRC Errors (2)

Anonymous Coward | about 2 years ago | (#41670575)

I've also had two OCZ drives die on me. The first to die was a 60 GB OCZ Agility which actually didn't see that much usage compared to the 32 GB Agility that I had and it died a little after a year. The Vertex, which was older than the Agility died about 6 months later. Both were RMA'd and now I have new ones, lets see how long they last. The reason I've heard why the die is that the controllers crap out, not the cells reach their write limit, which makes sense to me because when both of mine died I couldn't access it at all and the computer had problems detecting them.

Re:CRC Errors (0)

Anonymous Coward | about 2 years ago | (#41670703)

I'm guessing you're a Windows user. ERROR_CRC is what NTFS returns when it decides the disk is bad. I've also seen ERROR_FILE_CORRUPT in similar circumstances.

Die! (-1, Offtopic)

el_jake (22335) | about 2 years ago | (#41670135)

Death by tray it is!

Re:Die! (0, Offtopic)

Anonymous Coward | about 2 years ago | (#41670183)

I miss the days when people actually had something useful to add rather than constant lame attempts at humor.

Re:Die! (5, Funny)

Anonymous Coward | about 2 years ago | (#41670219)

Wow - you've been here a long long time then

Re:Die! (1, Insightful)

Quakeulf (2650167) | about 2 years ago | (#41670281)

I am new to commenting on /. and I think lame attempts at humor belong to 9GAG and Reddit.

Re:Die! (0)

Anonymous Coward | about 2 years ago | (#41670455)

You must be new ... oh, wait.

Re:Die! (-1)

dstyle5 (702493) | about 2 years ago | (#41670697)

Why so serious?!?!

Re:Die! (1)

Vrekais (1889284) | about 2 years ago | (#41670463)

Wish I had mod points so bad, Izzard references are always worthy of moding up in my book. If people don't want to read the humorous posts that's what the mod system is for :D

Umm (4, Insightful)

The MAZZTer (911996) | about 2 years ago | (#41670151)

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.

Re:Umm (-1)

Anonymous Coward | about 2 years ago | (#41670199)

yeah, sounds like submitter may be mildly deficient

Re:Umm (5, Insightful)

Anonymous Coward | about 2 years ago | (#41670293)

yeah, sounds like submitter may be mildly deficient

Which is why he's asking.

Fuck people who ask questions when they don't know something, right?

Re:Umm (5, Informative)

kelemvor4 (1980226) | about 2 years ago | (#41670315)

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.

Never heard of that. I've got about 450 servers each with a raid1 and raid10 array of physical disks. We always buy everything together, including all the disks. If one fails we get alerts from the monitoring software and get a technician to the site that night for a disk replacement. I think I've seen one incident in the past 14 years I've been in this department where more than one disk failed at a time.

My thought on buying them separately is that you run the risk of getting devices with different firmware levels or other manufacturer revisions which would be less than ideal when raided together. Not to mention you have a mess for warranty management. We replace systems (disks included) when the 4 year warranty expires.

Re:Umm (4, Informative)

StoneyMahoney (1488261) | about 2 years ago | (#41670347)

The rationale behind splitting hard drives in a RAID between a number of manufacturers batches, even for identical drives, it to try and avoid a problem with an entire batch that's slipped past QA from taking out an entire array of drives simultaneously.

I'm paranoid, but am I paranoid enough....?

Re:Umm (5, Insightful)

statusbar (314703) | about 2 years ago | (#41670431)

I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.

Re:Umm (4, Insightful)

ByOhTek (1181381) | about 2 years ago | (#41670459)

In general, if you get such an issue, it will happen early on in the life of the drives (one coworker had what he called the 30-day thrash rule - he would plan ahead and get a huge number of drives - the cheapest available meeting requirements, including avoiding manufacturers we had issues with previously, take a handleful, and thrash 'em for 30 days. If nothing bad happend, he'd either keep up 30 day thrashes on sets of hard drives, pulling out the duds, or just return the whole lot.

Re:Umm (4, Interesting)

MightyMartian (840721) | about 2 years ago | (#41670465)

Too true. Years ago we bought batches of Seagate Atlas drives, and all of them pretty much started dying within weeks of each other. They were still under warranty, so we got a bunch more of the same drives, and lo and behold within nine months they were crapping out again. It absolutely amazed me how closely together the drives crapped out.

Re:Umm (3, Interesting)

Anonymous Coward | about 2 years ago | (#41670615)

Warranty replacement drivers are refurbished, meaning they've already failed once. I've never had a refurb drive last a full year without failing. It's gotten bad enough that I don't bother sending them back for warranty replacement anymore.

Re:Umm (2)

Spazmania (174582) | about 2 years ago | (#41670509)

I lost a server once where the drive batch had a 60% failure rate after 6 months. Unless you're intentionally building the raid for performance (vice reliability), you definitely want to pull drives from as many different manufacturers and batches as you can.

Re:Umm (1)

DarwinSurvivor (1752106) | about 2 years ago | (#41670521)

I'm paranoid, but am I paranoid enough....?

Depends if you have backups.

Re:Umm (1)

Anonymous Coward | about 2 years ago | (#41670569)

I've seen it a few times myself over the past decade. The greatest example of this problem I saw was with a Sun StorEdge A3500 array which had sixty disks. There were three brands of disks used in that array: Seagate, IBM, and one other (can't recall at the moment). It was an almost even three way split. The array was used by a two node Sun Cluster running Oracle and SAP.

Out of no where the IBM drives started dying, almost one right after another. We would pull a drive, replace it, and as almost as soon as the LUN it was part of finished rebuilding another IBM drive failed, usually in a different LUN. Over the course of 48 hours we had close to 20 IBM drives failed.

It was determined that there was a bug in the firmware that caused those drives to die after a set amount of time. Any IBM drives that didn't die in that 48 hour window were replaced to avoid any more failures. Any other similar IBM drives we had in our datacenter had their firmware updated just as soon as a fix was released.

Fortunately due to good planning of how disks were spaced out between LUNs we didn't suffer an outage, but the cluster connected to that array suffered horrible performance that weekend. If the entire array used those drives we would have been screwed.

(note: this happened a long time ago, if my memory serves me correctly it was the IBM drives failing, but I could be wrong about that)

Re:Umm (1)

nine-times (778537) | about 2 years ago | (#41670623)

The idea is that, though you can't control all the variables of manufacturing, if you have a bunch of disks made with the same machinery at the same time, many of those variables will be the same and so they have an increased chance of failing at the same time-- especially since the increased activity of rebuilding a failed array can sometimes trigger additional failures.

So if you want to be safe, buying from different batches is preferable, even if you buy the same brand of functionally identical drives. I don't know if anyone does it anymore, but some manufacturers used to do this for you when you bought a new server with a big RAID.

Re:Umm (1)

interval1066 (668936) | about 2 years ago | (#41670689)

Seems like if you buy and commission two identical drives they would fail about the same time but that's not my exerience either. I guess the 2nd law exresses itself in these scenarios with differing exparation times.

Re:Umm (1)

hawguy (1600213) | about 2 years ago | (#41670395)

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.

I've heard the same, but judging from the serial numbers on disks in our major-vendor storage array, they seem to use same-lot disks, here's a few serial numbers from one disk shelf (partially obscured):

xx-xxxxxxxx4406
xx-xxxxxxxx4409
xx-xxxxxxxx4419
xx-xxxxxxxx4435
xx-xxxxxxxx4448
xx-xxxxxxxx4460
xx-xxxxxxxx4468

They look close enough to be from the same manufacturing lot. Unless the disk manufacturer randomizes disks before applying serial numbers when selling to this storage vendor. We do lose disks occasionally, but they seem to be spread out across different shelves (purchased at different times).

Re:Umm (0)

Anonymous Coward | about 2 years ago | (#41670559)

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.

I imagine the portion of the wafer from which your chips are cut is more important than precisely when they were fabbed.

Re:Umm (0)

Anonymous Coward | about 2 years ago | (#41670589)

I tried that theory with tapes once. I had my purchasing person track down DLT's from four different vendors. When they arrived they had clearly all come off the same manufacturing line.

I suppose they could have used identical shells but different media, but I kind of doubt it.

Oh well, at least they were probably from different batches.

Re:Umm (1)

na1led (1030470) | about 2 years ago | (#41670641)

Defective Hard Drives usually fail within the first few months. When we purchase new Disk Storage units, it's almost a guarantee 1 or 2 drives will fail in the first 6 months.

They shrink (2, Informative)

Anonymous Coward | about 2 years ago | (#41670193)

The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.

Re:They shrink (4, Informative)

tgd (2822) | about 2 years ago | (#41670307)

The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.

Filesystems, generally speaking, aren't resilient to the underlying disk geometry changing after they've been laid down. There's reserved space to replace bad cells as they start to die, but the disk won't shrink. Eventually, though, you get parts of the disk dying in an unrecoverable way and the drive is toast.

Re:They shrink (1)

Auroch (1403671) | about 2 years ago | (#41670513)

The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.

Filesystems, generally speaking, aren't resilient to the underlying disk geometry changing after they've been laid down. There's reserved space to replace bad cells as they start to die, but the disk won't shrink. Eventually, though, you get parts of the disk dying in an unrecoverable way and the drive is toast.

Yup, I had a 2nd gen kingston die. Ever had a flash drive go bad? Unless you buy one with a decent controller (sandforce, intel) then you'll have the same experience when your ssd dies.

Re:They shrink (5, Informative)

v1 (525388) | about 2 years ago | (#41670669)

The sectors you are talking about are often referred to as "remaps" (or "spares"), which is also used to describe the number of blocks that have been remapped. Strategies vary, but an off-the-cuff average would be around one available spare per 1000 allocatable blocks. Some firmware will only use a spare from the same track, other firmware will pull the next nearest available spare. (allowing an entire track to go south)

The more blocks they reserve for spares, the lower the total capacity count they can list, so they don't tend to be too generous. Besides, if your drive is burning through its spares at any substantial rate, doubling the number of spares on the drive won't actually end up buying you much time, and certainly won't save any data.

But with the hundreds of failing disks I've dealt with, when more than ~5 blocks have gone bad, the drive is heading out the door fast. Remaps only hide the problem at that point. If your drive has a single block fail when trying to write, it will be remapped silently and you won't ever see the problem unless you check the remap counter in smart. If it gets an unreadable block on a read operation, you will probably see an io error however. Some drives will immediately remap it, but most don't and will conduct the remap when you next try to write to that cell. (otherwise they'd have to return fictitious data, like all zeros)

So I don't particularly like automatic silent remaps. I'd rather know whean the drive first looks at me funny so I can make sure my backups are current and get a replacement on order, and swap it out before it can even think about getting worse. I prefer to replace a drive on MY terms, on MY schedule, not when it croaks and triggers any grade of crisis. There are legitimate excuses for downtime, but a slowly failing drive shouldn't be one of them.

All that said, on multiple occasions I've tried to cleanse a drive of IO errors by doing a full zero-it format. All decent OBCCs on drives should verify all writes, so in theory this should purge the drive of all IO errors, provided all available spares have not already been used. The last time I did this on a 1TB Hitachi that had ONE bad block on it, it still had one bad block (via read verify) when the format was done. The write operation did not trigger a remap, (and I presume it wasn't verified, as the format didn't fail) and I don't understand that. If it were out of remaps, the odds of it being ONE short of what it needed is essentially zero. So I wonder in reality just how many drive manufacturers aren't even bothering with remapping bad blocks. All I can attribute this to is crappy product / firmware design.

Re:They shrink (2)

klui (457783) | about 2 years ago | (#41670481)

Newer disks' cells aren't rated for more than approximately 5000 writes due to process shrink. You're basically hoping the manufacturer's write leveling firmware is enough to compensate.

Re:They shrink (1)

Spazmania (174582) | about 2 years ago | (#41670551)

I can't speak for the accuracy of this, but what I read is that when the SSD runs out of reserved space as a result of reallocation, it switches itself to read-only.

How do SSD's die (5, Funny)

AwesomeMcgee (2437070) | about 2 years ago | (#41670201)

Screaming in agony, hissing bits and bleeding jumperless in the night

Re:How do SSD's die (0)

Anonymous Coward | about 2 years ago | (#41670701)

Placed with all its treasures on its boat, which is then set on fire, and sent out to sea to... oh, wait, no, I'm thinking of vikings. My mistake.

When you're nearing maximum write limit (0)

Anonymous Coward | about 2 years ago | (#41670225)

is when you will see a degredation of performance and possible corruption.

Re:When you're nearing maximum write limit (3, Interesting)

theNetImp (190602) | about 2 years ago | (#41670319)

So by reason of thinking, if you have a RAID of 15 drives for storage of images, these images never change, they are written and never over written, then the SSDs should theoretically never die because they are only reading these bits now?

Re:When you're nearing maximum write limit (2, Interesting)

Baloroth (2370816) | about 2 years ago | (#41670523)

So by reason of thinking, if you have a RAID of 15 drives for storage of images, these images never change, they are written and never over written, then the SSDs should theoretically never die because they are only reading these bits now?

Reading flash is not 100% non-destructive, if you never do a re-write cells near each read cell (which is all of them, probably) will degrade over time. I believe the stored data will degrade over long periods of time in any case, but I'm not sure. But if you re-write data every year or so, they could probably last decades.

Re:When you're nearing maximum write limit (3, Informative)

SydShamino (547793) | about 2 years ago | (#41670633)

In theory, yes. In flashROM devices the erase process is the aging action. Your write-once-never-erase-read-only flash should last until A) enough charge manages to leak out of gates that you get bit errors, or B) the part fails due to corrosion or other long-term aging issue, similar to any piece of electronics.

If you have raw access to the flashROM you could in theory write the same data into the same unerased bytes to recover from bit errors (if you had an uncorrupted copy), so only aging failures would occur. But of course you can't do this with an SSD as you have no direct access to the memory, and the controller A) wouldn't let you write into unerased space, and B) wouldn't write the data into the exact same place again anyway.

They die... (1)

Anonymous Coward | about 2 years ago | (#41670233)

Spectacularly and without warning.

Firmware bugs (2)

Anonymous Coward | about 2 years ago | (#41670243)

Didn't happen to me, but a number of people with the same Intel SSD reported that they booted up and the SSD claimed to be 8MB and required a secure wipe before it could be reused. Supposedly it's fixed in the new firmware, but I'm still crossing my fingers every time I reboot that machine.

Re:Firmware bugs (0)

Anonymous Coward | about 2 years ago | (#41670355)

I can 100% confirm this. Happened to my Intel 320.

Re:Firmware bugs (1)

greg1104 (461138) | about 2 years ago | (#41670385)

That's the Intel 320 series drives. They didn't release a version of those drives claimed suitable for commercial work until the "8MB bug" was sorted out, as the much more expensive 710 series.

Flash SSD has Write Limitations so... (2, Informative)

Anonymous Coward | about 2 years ago | (#41670259)

From what I understand, SSD die because of "write-burnout" if they are FLASH based and from what I understand the majority of SSDs are flashed based now. So while I haven't actually had a drive fail on me, I assume that I would be able to still read data off a failing drive and restore it, making it an ideal failure path. I did a google search and found a good article on the issue: http://www.makeuseof.com/tag/data-recovered-failed-ssd/

Re:Flash SSD has Write Limitations so... (2)

Auroch (1403671) | about 2 years ago | (#41670531)

From what I understand, SSD die because of "write-burnout" if they are FLASH based and from what I understand the majority of SSDs are flashed based now. So while I haven't actually had a drive fail on me, I assume that I would be able to still read data off a failing drive and restore it, making it an ideal failure path. I did a google search and found a good article on the issue: http://www.makeuseof.com/tag/data-recovered-failed-ssd/ [makeuseof.com]

Which is why you can do the same from a failed usb flash drive?

It's a nice theory, but it's highly dependent on the controller.

wear leveling (2, Informative)

Anonymous Coward | about 2 years ago | (#41670261)

SSDs use wear leveling algorithms to optimize each memory cell's lifespan; meaning that it keeps track of how many times each cell was written and it ensures that all cells are being utilized evenly. When the cells fail, they're being kept track of and the drive does not attempt to write to that cell any longer. When enough cells have failed the capacity of the drive will shrink noticeably. At that point it is probably wise to replace it. For a RAID configuration the wear level algorithm would presumably still work as the RAID algorithm pumps even amounts of data to each drive (whether it is mirrored or striped). When any of the drives are shrinking in size it is presumably time to replace the array.

if they die at the same time repeatably.. (1)

gl4ss (559668) | about 2 years ago | (#41670263)

by performing same set of actions, in unreasonable time, then with 99.999%(the more drives, add 9's) probability it's a bug in the firmware/controller. afaik there shouldn't be such drives on market anymore..

otherwise the nands shouldn't die at the same time. shitty nands I suppose will die faster (a bad batch is shitty).

some drive controllers have counters about the nand use - but they shouldn't all blow up when it hits 0, at which point you're recommended to replace them.

I haven't had one die, though I do have a vertex 2 in daily thrashing use.

ask someone who tries to selvage data (0)

Anonymous Coward | about 2 years ago | (#41670265)

ask someone who tries to selvage data from dead ssd drive.
who?
him:
http://www.youtube.com/watch?v=vLoYduckmuo

They usually die gracefully... (5, Informative)

dublin (31215) | about 2 years ago | (#41670275)

In general, if the SSD in question has a well-designed controller (Intel, SandForce), then write performance will begin to drop off as bad blocks start to accumulate on the drive. Eventually, wear levelling and write cycles have taken their toll, and the disk can no longer write at all. At this point, the controller does all it can: it effectively becomes a read-only disk. It should operate in this mode until else something catastrophic (tin migration, capacitor failure, etc.) keeps the entire drive from working.

BTW - I haven't seen this either, but that's the degradation profile that's been presented to me in several presentations by the folks making SSD drives and controllers. (Intel had a great one a few years back - don't have a link to it handy, though...)

Re:They usually die gracefully... (0)

Anonymous Coward | about 2 years ago | (#41670503)

thank you.

X-25M Death: Firmware bug too? (5, Interesting)

Anonymous Coward | about 2 years ago | (#41670595)

I had an 80G Intel X-25M fail in an interesting manner. Windows machine, formatted NTFS, Cygwin environment. Drive had been in use for about a year, "wear indicator" still read 100% fine. Only thing wrong with it is that it had been mostly (70 out of 80G full) filled, but wear leveling should have mitigated that. It had barely a terabyte written to it over its short life.

Total time from system operational to BSOD was about ten minutes. I first noticed difficulties when I invoked a script that called a second script, and the second script was missing. "ls -l" on the missing script confirmed that the other script wasn't present. While scratching my head about $PATH settings and knowing damn well I hadn't changed anything, a few minutes later, I discovered I also couldn't find /bin/ls.exe. In a DOS prompt that was already open, I could DIR C:\cygwin\bin - the directory was present, ls.exe was present, but it wasn't anything that the OS was capable of executing. Sensing imminent data loss, and panic mounting, I did an XCOPY /S /E... etc to salvage what I could from the failing SSD.

Of the files I recovered by copying them from the then-mortally-wounded system, I was able to diff them against a valid backup. Most of the recovered files were OK, but several had 65536-byte blocks consisting of nothing but zeroes.

Around this point, the system (unsurprisingly, as executables and swap and heaven knows what else was being riddled with 64K blocks of zeroes) crashed. On reboot, Windows attempted (and predictably failed) to recover (assinine that Windows tries to write to iself on boot, but also assinine of me to not power the thing down and yank the drive, LOL.) The system did recognize it as an 80G drive and attempted to boot itself - Windows logo, recovery console, and all.

On an attempt to mount the drive from another boot disk, the drive still appeared as an 80G drive once, unfortunately, it couldn't remain mounted long enough for me to attempt further file recovery or forensics.

A second attempt - and all subsequent attempts - to mount the drive showed it as an 8MB (yes, eight megabytes) drive.

I'll bet most of the data's still there. (The early X-25Ms didn't use encryption). What's interesting is that the newer drives have a similar failure mode [intel.com] that's widely recognized as a firmware bug. If there were a way to talk to the drive over its embedded debugging port (like the Seagate Barracuda fix from a few years ago), I'll bet I could recover most of the data.

(I don't actually need the data, as I got it all back from backups, but it's an interesting data recovery project for a rainy day. I'll probably just desolder the chips and read the raw data off 'em. Won't work for encrypted drives, but it might work for this one.)

Re:They usually die gracefully... (4, Interesting)

AmiMoJo (196126) | about 2 years ago | (#41670625)

I had an Intel SSD run out of spare capacity and it was not fun. Windows kept forgetting parts of my profile and resetting things to default or reverting back to backup copies. The drive didn't report a SMART failure either, even with Intel's own SSD monitoring tool. I had to run a full SMART "surface scan" before it figured it out.

That sums up the problem. The controller doesn't start reporting failures early enough and the OS just tries to deal with it as best as possible, leaving the user to figure out what is happening.

It's because of the noise they make! (1)

SternisheFan (2529412) | about 2 years ago | (#41670283)

They die because the people living near Kennedy Airport complain and demonstrate about the noise... What?... Not SST's..., SS"D" 's...... Oh, well, that's different then.

Never mind.

Re:It's because of the noise they make! (1)

Steelwings (549043) | about 2 years ago | (#41670343)

Not to be confused with STD's

Re:It's because of the noise they make! (2)

SternisheFan (2529412) | about 2 years ago | (#41670387)

Not to be confused with STD's

Personally, I wouldn't really mind all that much if my STD's died, I can see an upside...

Re:It's because of the noise they make! (1)

neokushan (932374) | about 2 years ago | (#41670475)

Nobody on Slashdot will ever have to worry about those.

Re:It's because of the noise they make! (1)

SternisheFan (2529412) | about 2 years ago | (#41670527)

Nobody on Slashdot will ever have to worry about those.

Oooooh... (lol).. that's one of those "it's funny 'cause it's true" jokes. :-)

They die without warning and without recourse (3, Informative)

PeeAitchPee (712652) | about 2 years ago | (#41670287)

With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely. In my experience, though SSDs don't fail as often, when they do, it's sudden and catastrophic. Having said that, I've only seen one fail out of the ~10 we've deployed here (and it was in a laptop versus traditional desktop / workstation). So BACK IT UP. Just my $0.02.

Re:They die without warning and without recourse (4, Informative)

PRMan (959735) | about 2 years ago | (#41670583)

I have had two SSD crashes. One was on a very cheap Zelman 32GB drive which never really worked (OK, about twice). The other was on a Kingston 64GB that I have in my server. When it gets really hot in the room (over 100, so probably over 120 for the drive itself in the case), it will crash. But when it cools down, it works perfectly well.

Re:They die without warning and without recourse (5, Interesting)

cellocgw (617879) | about 2 years ago | (#41670653)

With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely.

OK, so I'm sure some enterprising /.-er can write a script that watches the SSD controller and issues some clicks to the sound card when cells are marked as failed.

Re:They die without warning and without recourse (5, Informative)

dougmc (70836) | about 2 years ago | (#41670657)

With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely.

Usually? No.

This does happen sometimes, but it certainly doesn't happen "usually". There's enough different failure mechanisms for hard drives that there isn't any one "usual" method --

1- drive starts reporting read and/or write errors occasionally, but otherwise seems to keep working
2- drive just suddenly stops working completely all at once
3- drive starts making noise (and performance usually drops massively), but the drive still works.
4- drive seems to keep working, but smart data starts reporting all sorts of problems.

Personally, I've had #1 happen more often than anything else, usually with a healthy serving of #4 at about the same time or shortly before. #2 is the next most common failure mode, at least in my experience.

SSDs do fail (1, Interesting)

Anonymous Coward | about 2 years ago | (#41670309)

Pretty much all SSDs have more then 8 chips in a configuration similar to RAID0. If any single chip has a problem, the entire drive is useless. I've seen SSDs fail from the cheap 40GB patriots, all the way up to the high end fusion io drives. *Most* of them died after power cycles, I guess if they are going to fail, that will usually be the time it happens. At least with the mechanical disks you can spend some cash and have it recovered after it fails.

How do SSD's die? (0)

Anonymous Coward | about 2 years ago | (#41670311)

Suddenly. I've had 2 SSDs fail on me and they both died a sudden and unexpected death.

Re:How do SSD's die? (1)

SternisheFan (2529412) | about 2 years ago | (#41670429)

Suddenly. I've had 2 SSDs fail on me and they both died a sudden and unexpected death.

How long did they last, if you don't mind me asking. Or is it "too soon"...

Bang! (4, Informative)

greg1104 (461138) | about 2 years ago | (#41670325)

All three of the commercial grade SSD failures I've cleaned up after (I do PostgreSQL data recovery) just died. No warning, no degrading in SMART attributes; works one minute, slag heap the next. Presumably some sort of controller level failure. My standard recommendation here is to consider then no more or less reliable than traditional disks and always put them in RAID-1 pairs. Two of the drives were Intel X25 models, the other was some terrible OCZ thing.

Out of more current drives, I was early to recommend Intel's 320 series as a cheap consumer solution reliable for database use. The majority of those I heard about failing died due to firmware bugs, typically destroying things during the rare (and therefore not well tested) unclean shutdown / recovery cases. The "Enterprise" drive built on the same platform after they tortured consumers with those bugs for a while is their 710 series, and I haven't seen one of those fail yet. That's not across a very large installation base nor for very long yet though.

Re:Bang! (5, Funny)

ColdWetDog (752185) | about 2 years ago | (#41670567)

Does anyone else find this sort of thing upsetting? I grew up during that period of time when tech failed dramatically on TV and in movies. Sparks, flames, explosions - crew running around randomly spraying everything with fire extinguishers. Klaxons going off. Orders given and received. Damage control reports.

None of this 'oh snap, the hard drive died'.

Personally, I think the HD (and motherboard) manufacturers ought to climb back on the horse. Make failure modes exciting again. Give us a run for the money. It can't be hard - there still must be plenty of bad electrolytic capacitors out there.

How about a little love?

Re:Bang! (1)

greg1104 (461138) | about 2 years ago | (#41670685)

There are more bad electrolytic capacitors out there than ever before. Problem is they don't blow in an exciting way anymore. The stupid things just bow out at the top [wikipedia.org] , with the case completely able to contain the explosion. So lame.

More importantly (1)

Anonymous Coward | about 2 years ago | (#41670333)

How do they get to Silicon Heaven?

Data corruption, then fails e2fsck upon boot (3, Informative)

vlm (69642) | about 2 years ago | (#41670357)

My experience was system crash due to corruption of loaded executables, then at the hard reboot it fails the e2fsck because the "drive" is basically unwritable so the e2fsck can't complete.

It takes a long time to kill a modern SSD... this failure was from back when a CF plugged into a PATA-to-CF adapter was exotic even by /. standards

Bad blocks (1)

Anonymous Coward | about 2 years ago | (#41670363)

I've had SSDs die... Basically just got an increasing number of bad blocks due to worn out flash cells.

Dunno about how, but I do know WHEN (2)

davidwr (791652) | about 2 years ago | (#41670365)

Like spinning drives, silicon drives always die when it will do the most damage [wikipedia.org] .

Like right before you find out all your backups are bad.

Re:Dunno about how, but I do know WHEN (0)

Anonymous Coward | about 2 years ago | (#41670535)

Strictly speaking, if you don't check your backups and they're bad, your drives will always fail right before you check your backups.

Check with Intel (-1)

Anonymous Coward | about 2 years ago | (#41670373)

You should Google Intel's site for detailed information. I can't believe you asked the snobbish Slashdoters for answers to this question. There are better ways, than asking cocky ubergeeks to find out why SLC drives are long lived and MLC drives vary in durability based on design parameters.

I have seen SSD death (5, Informative)

MRGB (2743757) | about 2 years ago | (#41670389)

I have seen SSD death many times and it is a strange sight indeed. What is interesting about it when compared to normal drives is that when normal drives fail it is - mostly - and all or nothing ordeal. A bad spot on a drive is a bad spot on a drive. With SSDs you can have a bad spot one place, reboot, and you get a bad spot in another place. Windows loaded on an SSD will exhibit all kinds of bizarre behaviour. Sometimes it will hang, sometimes it will blue-screen, sometimes it will boot normally until it tries to read or write to that random bad spot. Rebooting is like rolling the dice to see what it will do next - that is, until it fails completely.

raid 0 drive crased (0)

Anonymous Coward | about 2 years ago | (#41670397)

I had a couple vertex2 60's in raid 0 running windows 7.
1. At first my windows would reboot in the middle of the night.
2. It kept getting worse, eventually it got to the point where it would only boot for a few minutes, before crashing. Sometimes the drive wasn't recognized on post.
3. eventually OCZ replaced it. i had to tell them the red led was blinking on the drive indicating it was dead.

I assume if it was a few cells had gone bad it could recover, but to not show up on post, there must have been some bigger issue.
Only one of the identical drives crashed, the other has been running fine for months ( as a single drive now)

Hopefully, like my grandfather (0)

Anonymous Coward | about 2 years ago | (#41670405)

Who died peacefully, in his sleep. Unlike his train passengers who died painfully while screaming..

1 failed SSD experienced... (1)

StoneyMahoney (1488261) | about 2 years ago | (#41670425)

Only seen a single SSD fail. It was a Mini-PCIex unit in a Dell Mini 9. I suspect the actual failure may have been atypical as it seems it failed in just the right place to render the filesystem unwritable, although you could read from fairly hefty sections of it. It was immediate and irrepairable, although I suspect SSD manufacturers use better quality than that built-to-a-price (possibly counterfeit) POS.

Had one die twice (3)

bstrobl (1805978) | about 2 years ago | (#41670437)

Had an aftermarket SSD for a macbook air fail twice in 2 years (threw it out and placed an original hdd after that). Both times the system decided not to boot and could not find the SSD.

In both cases I have suspected that the Indilinx controller gave way. This seems mirrored in quite a few cases with the experience of others who had drives with these chips in them.

In an ideal scenario the controller should be able to handle the eventual wearout of the disk by finding other memory cells to write to. Any cells that have been used up should still be readable as well since the floating gates basically have been filled up with electrons and will not allow further erasing.

I guess the main issue right now is the fact that SSDs cant notify the user once things get a bit too worn out. Eventually the controller wont be able to keep up with the useless cells and then might simply no longer respond. Things will only get worse when the cycles go down due to smaller manufacturing processes so that useless controllers in cheap SSDs are more likely to fail

I had one fail (2)

kelemvor4 (1980226) | about 2 years ago | (#41670445)

I had a FusionIO IODrive fail a few weeks ago. It was running a data array on a windows 2008 r2 server. It manifested its-self by giving errors in the windows event log and causing long boot times (even though it was not a boot device). The device was still accessible, but slower than normal. I think the answer to your question will probably vary greatly both by manufacturer and also based on what part of the device failed. The SSD's I've used generally come with a fairly large amount of "backup" memory on them so that if a cell begins to fail, the card marks the cell bad and uses one from one of the backup chips. Much like how hard drives deal with bad sectors. As I understand it, the SSD is somehow able to detect the failure before data is lost and begin using the backup chips transparently and automatically vs having to do a scandisk or similar to do the same on a physical disk. That may very well vary by manufacturer as well.

Re:I had one fail (1)

greg1104 (461138) | about 2 years ago | (#41670601)

The FusionIO devices are provisioned with a fair amount of redundancy at the storage cell level. But if a part in the main controller goes boom, so does the whole device. I've seen that once so far, wasn't fun since the most critical parts of the data were stored there--trying to get the most out of the device's expense. Some of these units are just expensive enough that I've seen a depressing number of people buy just one (rather than a mirrored pair) after buying the sales pitch on the cell redundancy. If you're going to do that, make sure you have some sort of real-time replication over to a cheaper server going on, too.

Peacefully, with their loved ones at their bedside (2)

Revotron (1115029) | about 2 years ago | (#41670451)

as the disk controller reads them their last rites before they integrate with the great RAID array in the sky.

Oblig: T. S. Eliot (4, Funny)

stevegee58 (1179505) | about 2 years ago | (#41670489)

Not with a bang but a whimper.

die (1)

SuperRenaissanceMan (1027668) | about 2 years ago | (#41670511)

in a fire

Yes they do fail (2)

AnalogDiehard (199128) | about 2 years ago | (#41670519)

We use SSDs in a few Windows machines at work. Running 24/7/365 production. We were replacing them every couple of years.

Robot Odyssey (0)

Anonymous Coward | about 2 years ago | (#41670539)

That is all

My SSD is bad! (2)

dittbub (2425592) | about 2 years ago | (#41670541)

I have a G.Skill Falcon 64GB SSD that is failing on me. Windows chkdsk started seeing "bad sectors" (whatever this means for SSD... I think its really slow parts of the SSD) and started seeing more and more and windows would not boot. A fresh install of windows would immediately crash in a day or two. I had done a "secure erase" and that seemed to the job, a chkdsk found no "bad sectors". But a weeks later chkdsk found 4 bad sectors. But its going on a month now and I have yet to have windows fail.

Re:My SSD is bad! (1)

dittbub (2425592) | about 2 years ago | (#41670557)

I also updated the drive firmware since, that may have helped stability!

SSD wear cliff (4, Informative)

RichMan (8097) | about 2 years ago | (#41670609)

SSD's have an advertised capacity N and an actual capacity M. Where M > N. In general the bigger M realtive to N the better the performance and lifetime of the drive. As it wears it will "silently" assign bad blocks and reduce M. Your write performance will degrade. If you have good analysis tools it will tell you when it starts getting a lot of blocks near end of life and when M is getting reduced.

Blocks near end of life are also more likely to get read errors. The drive firmware is supposed to juggle things around so all of the blocks near end of life about the same time. With a soft read error the block will be moved to a more reliable portion of the SSD. That means increased wear.

1. Watch write perforamance/spare block count
2. If you get any read errors do a block life audit
3. When you get into life limiting events things accelerate to bad due to the mitigation behaviors

Be carefull depending on the sensitivities of the firmware it will let you get closer to catastrophe before warning you. More likely to be closer in consumer grade.

OCZ vertex 4 fails out of the box (0)

Anonymous Coward | about 2 years ago | (#41670635)

Hi,

We bought over 70 OCZ Vertex 4, and after 1 month, we had over 20 failures. About 5 of them were DOA, and the rest died in prod. They would crash windows and would not reboot.

So my experience with SSD is, BACKUP anything critical on a regular HDD.

Wanna see them die? Just get an OCZ (0)

Anonymous Coward | about 2 years ago | (#41670661)

OCZ makes the worst SSDs in the world, and it's not even a flash wear issue. For them, it's firmware. And, FFS, you have to update firmware on the goddamn things practically daily, and you can only do it by moving the drive to another machine, or with a hokey linux bootable CD, and while the planets are in a specific alignment and while holding the rabbit ears just a little to the left, except on Tuesdays when you have to hold them just a little to the right.

They just die inexplicably, and with no warning, and all of your data is just GONE.

No, they don't all age the same. (3, Informative)

YesIAmAScript (886271) | about 2 years ago | (#41670673)

It's statistical, not fixed rate. Some cells wear faster than others due to process variations, and the failures don't show up to you until there are uncorrectable errors. If one chip gets 150 errors spread out across the chip, and another gets 150 in critical positions (near to each other), then the latter one will show failures while the first one keeps going.

So yeah, when one goes, you should replace them all. But they won't all go at once.

Also note most people who have seen SSD failures have probably seen them fail due to software bugs in their controllers, not inherent inability to store data due to wear.

Usually the firmware or the NAND (1)

Anonymous Coward | about 2 years ago | (#41670683)

I'm a SSD firmware engineer so know this all in depth. If the SSD suddenly fails then most likely cause is a firmware bug putting the drive into a bad state or a catastrophic NAND failure. It all depends on how well the firmware and NAND are tested.The trickiest part of the firmware to get right is the unsolicited power cycle. So make sure to shutdown the system properly. As for the NAND, it might be good to do a burn-in write of random data on full drive capacity between 3 to 30 times to scrub out the early NAND block failures. A good manufacturer would already do this.

giyfs (0)

Anonymous Coward | about 2 years ago | (#41670687)

http://computer.howstuffworks.com/solid-state-drive5.htm

Old SSDs never die... (1)

LeDopore (898286) | about 2 years ago | (#41670707)

Old SSDs never die. They just lose their bits.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>