Why does an SSD stop working?

Solid state drives (SSDs) have become a popular storage device in computers due to their faster speeds and lack of moving parts compared to traditional hard disk drives (HDDs). However, SSDs can and do fail over time. There are several reasons an SSD may stop working properly or completely fail to function.

Table of Contents

Wear Out of NAND Flash Memory Cells

The NAND flash memory cells that make up an SSD have a limited lifespan and can only withstand a certain number of erase/write cycles before beginning to wear out. Most SSDs are rated for a certain endurance, such as 100 TBW (terabytes written) for consumer models and up to 1 petabyte for enterprise drives. However, once an SSD exceeds its rated endurance, the NAND flash memory cells will start to fail and the drive will no longer store data reliably.

How NAND Flash Memory Works

To understand why the cells wear out, it helps to understand how NAND flash memory works. Each cell consists of a floating gate transistor that stores data based on the presence or absence of electric charge on the floating gate. To write data, a high voltage is applied to inject electrons onto the floating gate, changing the cell’s threshold voltage. Erasing data involves removing those electrons to bring the voltage back down.

Each time a cell goes through this program/erase cycle, it causes stress and trapped electrons in the insulating oxide layer. After thousands of cycles, this oxide wear out can lead to permanent errors and failures in the cells.

Write Amplification Factor

The total endurance of an SSD is also affected by something called the write amplification factor. This refers to the amount of actual writes to the NAND flash compared to the writes sent by the host system. For example, a write amplification factor of 2x means the SSD’s flash memory has to do 2 writes for every 1 write command from the host. This amplification happens due to processes like garbage collection, wear leveling, and maintaining over-provisioning space.

A higher write amplification factor causes extra stress on the NAND flash cells, wearing them out faster. SSD controllers are designed to minimize this, but it still contributes to reduced endurance compared to the raw PE cycle rating of the NAND flash chips alone.

Failure of SSD Controller or Other Components

In addition to wear on the NAND flash memory itself, SSDs contain many other components that can fail over time, including:

SSD Controller – The controller manages all operations on the SSD, including reading/writing data, wear leveling, and garbage collection. It can fail if defective or from old age.
DRAM Cache – Stores mapping tables and speeds up data access. Loss of power to DRAM can cause data corruption.

Power Supply Components – Provides stable voltage for SSD operation. Capacitor failure can cause data loss.
Interconnects – Internal bus interfaces that connect the controller, NAND dies, and other components.

If any of these parts fail or degrade beyond reliable functioning, it can render an SSD inoperable even if the NAND flash memory itself still functions properly.

Logical or File System Errors

In some cases, an SSD may seem to have stopped working when the issue is not actually a hardware failure but a logical error or file system issue. Some examples include:

Corrupted Firmware – Bugs or power failure during a firmware update can corrupt the SSD’s firmware and software.
File System Errors – If critical file system structures get corrupted, the SSD may become undetectable by the OS.

TRIM Command Errors – The TRIM command maintains SSD performance. If it is disabled or not working, the SSD may perform poorly.
Encryption Errors – On encrypted drives, errors in the encryption keys or algorithms can make data inaccessible.

These types of logical errors may be repairable by reflashing firmware, running disk repair tools, or reformatting, without requiring replacement of failed hardware.

Factors That Accelerate SSD Aging

There are several usage conditions that can accelerate the aging and wear out of an SSD, shortening its lifespan compared to the rated endurance:

Excessive Reads/Writes – High disk usage works the NAND cells harder and wears them out faster.
Sustained Workloads – Heat speeds up aging. Long workloads generate more heat versus intermittent disk activity.

High Ambient Temperatures – High ambient air temperatures around the SSD also increase internal temperatures.
Low Over-Provisioning – Less spare area leads to more write amplification and wear.
Encryption – Drive encryption increases write amplification.

File System Fragmentation – More writes required due to fragmentation.

For consumers, factors like excessive downloads, gaming, video editing, database applications, encryption, and neglecting to defragment file systems can all contribute to premature SSD failure.

Detecting and Preventing SSD Failure

To help avoid sudden SSD failure and data loss, there are some key indicators of aging or problematic SSDs that can be watched for:

S.M.A.R.T. errors – Tools like CrystalDiskInfo can monitor SSD health parameters like wear leveling count, erase/program cycles, and total data written.
Performance changes – As NAND wears out, SSDs generally show increasing latency and reduced sequential write speeds.
Bad blocks – The SSD controller has to take growing or bad blocks out of use. The change in capacity can indicate issues.

Program failures – Attempts to write data start generating more errors. The drive may need to retry and use error correction.

To prolong SSD lifespan:

Minimize unnecessary disk writes and defragment files to reduce write amplification.

Use SSDs designed for heavier workloads and with higher endurance ratings for demanding applications.
Maintain good ventilation and ambient temperatures.
Consider wear-leveling technologies like Optane memory to extend SSD life.

Recovering Data from Failed SSDs

When an SSD has failed completely and no longer mounts or is accessible, recovering the data off of it requires specialized tools and techniques. Here are some options:

Repair Shops

Data recovery specialists have equipment to repair SSDs by replacing failed components like controllers. They also have tools to bypass damaged areas of the flash memory and extract remaining data off failing drives.

DIY Methods

For advanced users, self-repair methods like soldering on a new controller or transplanting flash memory chips onto a working board are possible. This requires expertise and specialized tools.

Data Recovery Software

If the SSD failure is logical like corruption instead of physical, recovery software can sometimes read past errors and restore corrupted data. But this is not effective for permanently failed memory chips.

SSD Failure Rate Statistics

Industry studies have looked at large fleets of SSDs to analyze real-world annualized failure rates (AFR) in different applications. Some example failure rate statistics:

Use Case	Avg. AFR
Laptop and Consumer SSDs	1.5%
Enterprise Server SSDs	0.7% – 0.9%
High Performance Computing SSDs	1% – 3%

Failure rates are highest early in life (infant mortality) and toward the end of life as wear out occurs. Enterprise and server-grade SSDs tend to be higher quality than consumer drives and exhibit lower failure rates on average.

SSD Lifespan Considerations

In general, modern SSDs can reasonably last somewhere from 3-10 years in typical consumer workloads before reaching the end of their usable lives. However, there are many variables that affect lifespan:

Amount of data written – Drives worn out faster with heavier usage.
Quality of NAND flash – Higher-end enterprise SSDs use more durable SLC NAND.

Level of over-provisioning – More spare area reduces wear.
Operating conditions – Cool and clean environment extends life.
Controller and firmware – More advanced controllers do better wear leveling.

For typical office uses, a consumer-grade SSD should operate reliably for 5+ years before failures become likely. In write-intensive server applications, SSD replacement every 3-4 years is common. Backup practices are essential with any type of storage media to protect against sudden failures.

Conclusion

SSDs can and do eventually wear out or fail due to factors like NAND flash memory wear, controller failures, physical damage, environment issues like overheating, firmware bugs, SSD optimization best practices not being followed, and normal statistical failure rates. However, SSD lifespan has increased dramatically in modern devices and will continue improving with new technologies. Practicing good data backups, monitoring SSD health metrics, and proper maintenance can all help minimize the chances of catastrophic SSD failure leading to data loss.