What causes NVMe SSD to fail?

NVMe (Non-Volatile Memory Express) SSDs are a new generation of solid state drives that are becoming increasingly popular for high performance applications like data centers, gaming PCs, and enterprise servers. Unlike traditional SATA SSDs, NVMe SSDs connect directly to the PCIe bus which allows for much higher bandwidth and lower latency. This results in NVMe SSDs being significantly faster than older SATA SSDs with sequential read speeds up to 3500MB/s compared to 540MB/s for SATA SSDs (Source).

The high performance and declining cost of NVMe SSDs have fueled rapid adoption in recent years. According to one report, the NVMe market is projected to grow at a CAGR of 29.7% from 2020 to 2025 as NVMe becomes the interface of choice for enterprise and data center applications that require storage with low latency and high IOPS (Source). NVMe SSDs are poised to eventually replace SATA SSDs and even SAS SSDs in many use cases going forward.

Physical Damage

One of the most common causes of NVMe SSD failure is physical damage to the drive itself. SSDs contain delicate components like flash memory chips, controllers, and capacitors that can break if exposed to shock, vibration or impact forces 1. For example, dropping an SSD or laptop can crack solder joints or damage internal chips. Vibration during shipping or operation can also loosen connections over time. Even normal use in a computer involves some minute vibrations that take a toll on the drive. Heavier impacts like bumping or knocking over a computer can instantly damage an SSD.

Since there are no moving parts, SSDs are less prone to shock damage compared to traditional hard disk drives. But they still contain fragile silicon chips and solder connections. Enterprise-class SSDs designed for data centers and servers feature more robust components engineered to withstand higher shock and vibration levels. But consumer SSDs found in laptops and desktops have less protection. The circuit board can crack if flexed, and the solder joints will weaken over time with vibration. So physical damage is a very real failure risk.

Overheating

High temperatures can degrade the components of an SSD over time and lead to premature failure. Most SSDs are rated to operate at temperatures between 0°C and 70°C (32°F – 158°F) (Source). However, temperatures above 50°C for extended periods can accelerate wear and reduce the lifespan of an SSD.

High temps affect the NAND flash memory chips, controller chip, and other components on the SSD’s circuit board. Heat causes the electrical connections and silicon structures within the chips to degrade over time. This gradual degradation can eventually lead to read/write errors and failure.

Factors that can contribute to overheating include inadequate airflow, high ambient temps, insufficient cooling on devices like laptops, heavy sustained workloads, and direct sunlight exposure. Using an external fan or heatsink can help lower SSD temps in desktop PCs.

Monitoring your SSD’s temperature, ensuring adequate ventilation, and avoiding extended high load situations can help mitigate overheating risks. SSDs used within their rated temp range will have the highest reliability and lifespan.

Write Endurance

One of the main factors that impacts SSD lifespan is write endurance. SSDs use NAND flash memory cells to store data. Each cell has a limited number of write cycles before it wears out – usually between 3,000 to 100,000 cycles depending on the type of NAND flash. SLC NAND can endure around 50,000 to 100,000 write cycles, while MLC NAND can only withstand up to 3,000 write cycles typically. Once a cell reaches its limit, it can no longer reliably store data.

Frequent writes to the SSD, especially large sequential writes, will cause higher write amplification that wears out the NAND flash cells faster. Most consumer SSDs are rated for a certain amount of terabytes written (TBW) over the warranty period, usually 0.3-0.5 drive writes per day. Exceeding this throughput and write endurance rating will shorten the usable lifespan of the SSD.

Read Disturb Errors

Read disturb refers to a phenomenon where the repeated reading of data in an SSD without writing new data can eventually cause bit errors and data corruption (Li et al., 2020). The high-voltage read operations in NAND flash memory can inadvertently bias the voltage levels of adjacent cells over time. This causes electron traps to form in the oxide layer and leads to incorrect readings of the stored voltage levels, essentially flipping bits from 1s to 0s or vice versa.

According to research by Delkin Devices, read disturb happens because “NAND flash memory stores data by trapping electrons on a floating gate. Applying voltage to the floating gate enables reads and writes. With read commands, voltage is applied to the gates of both selected and unselected cells. Eventually, the applied voltage can alter the amount of trapped electrons in unselected cells – changing cell values from a 1 to a 0 or vice versa.”

Thus, frequent SSD read operations without intervening write operations to refresh the data can eventually cause bit errors (Gerofi et al., 2020). SSD controllers attempt to mitigate this issue through read reclaim/refresh operations that rewrite data after a certain number of reads. However, read disturb still contributes to progressive SSD failures, especially for files that are frequently read but rarely updated.

Early Wearout

One cause of early NVMe SSD failure is early wearout from manufacturing defects. This refers to SSDs that fail soon after deployment, often within the first year of use. According to Backblaze’s 2022 HDD and SSD stats, SSDs exhibited a noticeable “bathtub curve” failure pattern, with higher failure rates early on that declined after the first year of use.

Backblaze found that Crucial MX500 SSDs in particular had high early failure rates, over 17% in the first year compared to under 2% for Samsung drives (Arstechnica). They posit this was due to a bad production batch of drives. Other sources of early defects include contamination, damage during shipping and handling, and problems with firmware or the SSD controller.

Overall, Backblaze found SSDs to have a 0.89% annual failure rate, declining after year one (ExtremeTech). While most drives make it past initial wearout issues, manufacturing defects can still cause a significant number of early SSD failures.

Power Loss

One of the most common causes of NVMe SSD failure is unexpected power loss during write operations, which can lead to data corruption or loss (Source). NVMe SSDs use volatile RAM to buffer incoming writes before committing them to permanent storage. If power is lost before the write cache is flushed, any data held there will be erased. Repeated power failures eventually corrupt the SSD’s firmware and render it unusable.

To mitigate this risk, enterprise NVMe SSDs implement power loss protection technologies like tantalum capacitors to provide power for a short time after an outage, allowing buffered writes to complete. However, consumer NVMe SSDs often lack adequate power loss protection. Using a UPS can reduce the likelihood of corruption due to power failure.

Controller Failure

One of the most common causes of NVMe SSD failure is a malfunctioning controller chip (Source). The controller chip is responsible for managing all of the storage operations on the SSD, including reading, writing, erasing, and correcting errors. If the controller develops a flaw or fails entirely, it can render the SSD completely unusable and unable to be accessed by the host system.

Some of the potential causes of controller failure include:

  • Manufacturing defects – Imperfections in the silicon can cause components to fail over time.
  • Overheating – Excessive heat buildup can damage the controller.
  • Firmware bugs – Errors in the controller’s programming can lead to crashes or lockups.
  • Power surges – Electrical spikes can fry the sensitive controller electronics.

When the SSD controller malfunctions, the drive will become undetectable by the computer BIOS and operating system. This essentially “bricks” the device, making data recovery extremely difficult, if not impossible, without specialized tools and techniques. Replacing the failed controller chip would be cost-prohibitive in most cases. Controller failure underscores the importance of having backups of important data stored on SSDs.

Firmware Bugs

Firmware bugs refer to errors or defects in the firmware code of an SSD controller. These bugs can cause the SSD to malfunction or fail prematurely. For example, bugs may prevent the SSD from initializing properly, corrupt data, or cause the drive to freeze or lock up. Two recent examples of firmware bugs impacting major SSD models include:

Samsung acknowledged an issue with the latest firmware for 980 Pro SSDs causing the drives to unexpectedly enter read-only mode, rendering them unable to write data. This critical bug bricked many 980 Pro drives until Samsung issued a firmware update to address it (source).

Multiple SanDisk Extreme and Extreme Pro portable SSD models have been plagued by a firmware defect causing data loss or corruption. The bug can permanently disable the SSDs or wipe all data. SanDisk has promised a firmware fix, but only for some affected models so far (source).

These examples demonstrate how firmware bugs can catastrophically damage SSDs. Software defects that escape testing can lead to file system corruption, permanent unrecoverable errors, or total SSD failure. Firmware updates may fix bugs, but not all models receive fixes. Bugs also highlight the risks of firmware complexity in modern SSDs.

Prevention Tips

There are several things you can do to help prevent and prolong the life of your NVMe SSD:

Monitoring – Keep an eye on your SSD’s health using disk monitoring tools like CrystalDiskInfo. This can alert you to potential problems before failure occurs.

Backups – Regularly back up your important data stored on the SSD. This ensures you won’t lose data if the drive fails unexpectedly.

Cooling – Ensure proper cooling and airflow around the SSD to prevent overheating. Consider a dedicated cooling solution like a heatsink or fan.

Quality Parts – Choose a high-quality SSD from a reputable brand. Lower-end models may be more prone to early failure.

Properly maintaining your NVMe SSD through monitoring health, preventing overheating, backing up data, and using quality components can help maximize lifespan and avoid premature failure.