Why do SSD drives fail? - Darwin's Data

SSD (solid state drive) technology has become very popular in recent years due to its faster speeds and lower power consumption compared to traditional HDD (hard disk drives). However, SSDs have a limited lifespan and will eventually fail. In this 5000 word article, we will examine the various factors that can cause an SSD to fail.

Table of Contents

Wear and Tear on NAND Flash Memory Cells

The storage medium inside SSDs is NAND flash memory. This is different from the spinning magnetic platters used in HDDs. NAND flash memory stores data in memory cells made up of floating gate transistors. To write data to a cell, a high voltage is applied to inject electrons into the floating gate, changing its state from 1 to 0. To erase the cell, a high negative voltage is applied to remove electrons from the floating gate, changing its state back from 0 to 1.

Each memory cell in an SSD has a limited number of program/erase (P/E) cycles it can endure before it becomes unusable and fails. Typically, the P/E cycle rating is around 3000-5000 for SLC NAND flash, 1000-3000 for MLC NAND flash, and as low as 100-1000 for TLC NAND flash. Once a cell reaches its maximum P/E cycle limit, it can no longer reliably store data, leading to SSD failure.

Write Amplification

One factor that accelerates the wearing out of NAND flash cells in an SSD is write amplification. SSDs use a technique called wear leveling to distribute writes evenly across all the cells in the drive so that some cells do not wear out much faster than others. However, the process of wear leveling itself can actually increase the total number of writes to the drive.

For example, when a section of an SSD becomes compromised due to a large number of writes, the controller will copy the valid data from that section to a new block. The old block is then erased before being made available again for new writes. This process helps distribute writes evenly but results in additional writes to the drive beyond just the original write operation. Depending on the controller algorithm, the write amplification effect can be anywhere from 1.1x to over 20x for a single write!

Higher write amplification reduces the lifespan of an SSD proportionally. SSDs with lower write amplification specifications can better withstand intensive write-heavy workloads and last longer before reaching the write endurance limits of NAND flash.

Read Disturb Errors

While writing to NAND flash cells causes wear, reading from the cells can also eventually lead to data errors. The act of reading data from a NAND flash cell requires applying a voltage to detect the state of the cell. However, applying read voltages thousands of times can cause electrons to leak onto the floating gate of adjacent cells, changing their state. This phenomenon is known as read disturb.

Read disturb rates increase as NAND flash process geometry shrinks. With every NAND die shrink, cell walls become thinner, causing more interference between cells. Read disturb was not much of an issue with older 50nm-90nm NAND flash. But with modern 2D NAND at 19nm-25nm and 3D NAND at 50+ layers, read disturb has become more prevalent, causing background read errors.

Fortunately, modern SSDs deploy read disturb management techniques, such as periodic read scrubbing and error correction code (ECC). Still, accumulated read disturbs can eventually overwhelm these protections, leading to uncorrectable errors during normal reads. Workloads with very heavy continuous reads can accelerate read disturb rates.

Write Fatigue

In addition to P/E cycles wearing out NAND flash cells, writes can also “fatigue” the oxide layer inside the cells. Electrons tunneling through the oxide during program and erase operations gradually damage the thin insulating layer. Over time, a cell’s threshold voltage windows get distorted, making it harder to detect state. The cell may not be worn out in terms of P/E cycles, but thousands of writes can fatigue and weaken its charge storage ability.

This write fatigue effect varies depending on the NAND flash technology. Planar 2D NAND tends to exhibit more severe write fatigue issues compared to newer 3D NAND. The different architecture of 3D NAND stacks cells vertically, allowing for better durability. However, no NAND flash technology is completely immune to cumulative write fatigue effects.

Read Only Memory (ROM) Failure

Alongside the NAND flash storage, SSDs also contain onboard firmware and logic in read-only memory (ROM) chips. Typically, ROM chips are used rather than rewritable flash as they are more resilient against wear. However, ROM chips are still susceptible to failures over time.

ROM is classified as either fuses or anti-fuses. Fuses are links that connect logic gates. Anti-fuses create links when current is applied. Failures can happen when fusible links disconnect or anti-fuses get shorted and form faulty connections. Cosmic radiation can also randomly flip bits in ROM chips.

If critical boot data or firmware code bits stored in ROM get corrupted, the SSD may become undetectable or inaccessible by the host computer. Thankfully, complete ROM failures are quite rare in modern SSDs due to advanced error detection and redundancy mechanisms.

Internal Data Corruption

In addition to the storage media wearing out, SSDs are also vulnerable to internal data corruption. This can happen when stray writes occur during read-modify-write operations, scrambling up page or block mapping tables.

Garbage collection routines that relocate user data to new blocks present another avenue for possible data corruption. Bugs in firmware code can also lead to the unintended alteration of data. Power failures during write operations likewise pose a risk of partial page data being written.

To guard against internal data corruption, SSD controllers employ checksum validations on critical metadata structures. ECC also protects the actual user data stored in NAND flash. Together, these mechanisms ensure any corrupted data gets detected and recovered via redundancy. However, accumulation of uncorrected errors over time will eventually spell the end of reliable access to stored data.

External Physical Damage

Like all electronic devices, SSDs are vulnerable to physical damage from external environmental factors. Dropping an SSD, power surges, strong magnetic fields, static electricity discharge – all can potentially damage the drive at the hardware level.

Vibration is one particular hazard for SSDs. Since they have no moving parts, SSDs can withstand much stronger vibration compared to HDDs. However, extended long-term vibration can still disrupt component solder joints and internal connections. Server environments are especially prone to vibration issues that can accelerate SSD failure.

Extreme cold temperatures can also harm SSDs by contracting the circuit board and components, possibly fracturing solder joints. High temperatures are similarly detrimental, causing dry joints as board laminates expand. Thermal cycling through a wide temperature range compounds these thermal stresses over time.

Insufficient Error Correction Capabilities

SSD controllers deploy a number of error detection and correction mechanisms to mitigate the inherent vulnerabilities of NAND flash memory. However, not all SSDs are created equal when it comes to the strength of ECC capabilities.

Low-end consumer SSDs may only use 1 bit ECC based on BCH or RS codes. While this helps correct single bit flips, it cannot fix 2+ bit errors. High-end enterprise SSDs support up to 70 bits ECC based on LDPC engines to handle an entire page or two being damaged.

If the SSD controller ECC capabilities are exceeded by the actual number of errors accumulating in the NAND flash, uncorrectable errors will be returned when accessing data. No amount of redundancy or re-reads can recover data if the ECC is too weak to fix the errors.

Insufficient ECC margins is one reason why SSDs typically begin to fail linearly rather than suddenly. The number of damaged bits per page inches closer over time to the limits of what the ECC can handle. Eventually, the ECC cap is breached, resulting in SSD failure.

Component Failures

Like all electronics, the individual integrated circuits (ICs) that make up an SSD can suffer age-related failures. The NAND flash, SSD controller, DRAM cache, and other components slowly degrade over time.

For example, the thin gate oxide layer inside transistors grows more defective traps over time. Junction leakage currents increase, amplifying power consumption. Electromigration wears down metal interconnect traces. Thermal stresses crack silicon die and package materials. Even the capacitors on voltage regulation modules can dry out.

Chip reliability engineering has come a long way in knowing how to model and mitigate such failure mechanisms. However, aggregate wear-out effects inevitably limit the useful lifespan of SSD components. Once a critical path component fails, the SSD will stop working.

Early Life Failures

In addition to intrinsic wear-out issues, components can also fail early on due to manufacturing defects. Due to the extremely small process geometries involved in fabricating NAND flash and controller ICs, there is always a risk of wafers containing defective die. Insufficient burn-in and stress testing can also allow infant mortalities to enter production.

Early life failures account for a good portion of dead-on-arrival (DOA) SSDs that simply stop working earlier than they statistically should based on wear-out rates. Such randomness factors into why SSD lifespan predictions are modeled as probability distributions with a mean and long tails.

Controller Failures

The SSD controller chip lies at the heart of the drive’s operation. It manages all the essential functions like error correction, wear leveling, bad block management, garbage collection, I/O scheduling, etc. If the controller fails, the SSD is bricked regardless of the condition of the NAND flash.

Some factors that can cause SSD controllers to fail include:

Electrostatic discharge (ESD) damage
Electrical overstressing

Latch-up from high-energy particles
Time-dependent dielectric breakdown
Hot carrier injection wearing out gate oxide

Thermal runaway due to firmware bugs
Component overheating and failure

Enterprise SSDs designed for data centers may utilize redundant controller chips and mapping tables distributed across multiple channels. This provides redundancy against individual controller failures. Consumer SSDs usually rely on just a single controller chip, however.

Excessive Bad Blocks

During the NAND flash manufacturing process, it is inevitable that a small percentage of flash dies end up being defective or develop unprogrammable blocks. SSDs are designed with spare area to remap such bad blocks out of the logical address space so they are no longer used.

However, if the concentration of bad blocks exceeds the spare area set aside, the SSD will run out of remapping capacity. The original bad block limit tolerated may have also been prematurely consumed up by early wear-out of NAND flash. In either case, data writes are blocked once the SSD has no more good blocks to remap to.

Sophisticated controllers can apply advanced signal processing techniques to recover some marginal blocks deemed bad initially. However, there is a physical limit to the number of truly defective blocks that can be tolerated before the SSD fills up.

Component Supply Shortages

While not a literal cause of SSD failure, shortages in supplies of NAND flash, DRAM, and other components threatens the manufacturing pipeline for new replacement SSDs. In 2022, several new disruptive economic and geopolitical factors constrained supplies of crucial semiconductor inputs.

The COVID-19 pandemic and Ukraine war disrupted many established supply chains. Fabs also shut down production temporarily during lockdowns. Rising inflation and interest rates also hampered demand. These combined factors sent the SSD market into a disequilibrium with shortages of all components.

As a result, SSD pricing shot up in 2022. Lead times for delivery of new enterprise SSD orders stretched to over a year. Data center operators faced difficulties procuring sufficient SSD spares to maintain proper redundancy. While not literal SSD failures, supply shortages exacerbate availability issues in replacing failed drives.

Insufficient Over-provisioning

One way SSD manufacturers build endurance into their drives is by over-provisioning extra spare area beyond what is disclosed to the host system. This surplus NAND flash handles remapping bad blocks out of user capacity and enables wear leveling.

However, this over-provisioning capacity is finite. Once an SSD has consumed up its spare area to repeatedly remap bad blocks, its write endurance will rapidly degrade. If the original over-provisioning level was insufficient, the SSD will hit the write cliff much sooner in its lifespan.

Cheap low-end consumer SSDs often specify only ~7% over-provisioning. Versus high-end enterprise models with up to 70% OP to handle hefty write workloads. Insufficient over-provisioning directly accelerates the rate of SSD wear-out and eventual failure.

Random Failures

In addition to the steadily accumulating damage of writes exacerbating component wear-out, SSDs are also susceptible to random unpredictable failures. Such failures may exhibit no signs of progressive warning and take the SSD down abruptly.

For example, stray alpha particle hits from background radiation can flip data bits stored in NAND flash cells. Damage from electrostatic discharge or power surges can destroy SSD components randomly. Electrical noise and crosstalk issues may sporadically corrupt data.

Hidden manufacturing defects can also cause random failures. Latent defects may evade initial factory testing only to surface later under certain use conditions. Thermal cycling and vibration stresses over time can also dislodge marginal solder joints.

So even if an SSD is lightly used well within write endurance limits, random failures can still abruptly end its life prematurely.

End of Life Wear-Out

Assuming no catastrophic random failure, the typical eventual path to SSD demise is end of life wear-out. Both NAND flash cells and onboard components all have finite lifespans dictated by physics and engineering margins.

Repeated use of the SSD accumulates damage from multiple degradation mechanisms that gradually build up over time. Cycling through daily power on-off stresses circuitry. Background drive maintenance operations consume some of the P/E cycle and data retention budgets. Unavoidable environmental vibration and thermal stresses take their toll.

Permanent faults accumulate until the SSD controller can no longer maintain data integrity and performance. The drive begins manifesting more frequent uncorrectable errors when accessing data. Read and write latency increase as damage accumulates. Bad blocks consume up spare capacity needed for wear leveling.

This gradual process culminates in the SSD transitioning from mostly healthy to completely failed. It may cross slowly through a “danger zone” warning state first via S.M.A.R.T. telemetry feedback. But eventually, the SSD will reach the end of the road and cease functioning.

Mitigating SSD Failure

While SSD failure is inevitable long-term, steps can be taken to minimize risks and extend usable lifespan:

Purchase enterprise-grade SSDs for mission-critical data – More robust components and ECC tolerate damage
Proactively replace aging SSDs rather than waiting for failure
Use capabilities like TRIM and garbage collection to optimize performance

Heed warning signs like SMART attributes and uncorrectable errors
Manage workloads to spread writes across all cells
Maintain proper cooling, voltages, physical handling

Use RAID configurations to tolerate individual SSD failures

With care and maintenance, SSDs can reliably sustain many years of service. But inevitably, failures will occur, highlighting the importance of backups, redundancy, and replacements parts.

Conclusion

SSD reliability has improved enormously over the years, but still comes up short compared to time-tested HDDs. The very properties that enable SSDs to be fast, low-power storage – placing bits in silicon chips rather than on magnetic media – subject them to a finite lifespan. While technological innovations continue to push the limits, NAND flash cells and integrated circuits remain susceptible to unavoidable wear-out and random failure mechanisms.

IT organizations must account for the statistical inevitability of SSD failures. Through smart provisioning practices, workload management, and redundancy schemes, storage admins can maximize the value delivered by SSDs within the constraints of limited endurance and longevity.