What makes an SSD fail? - Darwin's Data

Solid state drives (SSDs) have become increasingly popular in computers over the past decade, replacing traditional hard disk drives (HDDs) in many applications due to advantages like faster read/write speeds, lower latency, better reliability, and the absence of moving parts. However, SSDs are still susceptible to failure through various means. Understanding the main factors that cause SSD failure can help users avoid data loss and extend the usable lifespan of these drives.

How Do SSDs Work?

Before diving into the ways an SSD can fail, it helps to understand what’s going on under the hood. SSDs consist of a controller and non-volatile flash memory chips. The controller manages all the data writing, reading, and erasing operations while the flash chips store the actual data.

When a file is saved or updated on an SSD, the data is written to empty blocks in the flash memory. To optimize performance, the controller maintains a map of used and available blocks and will write data to different locations across multiple flash chips in parallel. This process is known as wear leveling and helps distribute writes evenly to avoid overuse of any single block.

SSDs have no moving mechanical parts, which removes the risks of mechanical failure you see in traditional hard drives. But the complex data management happening through the SSD controller and the properties of NAND flash memory bring about other potential modes of failure.

Write/Erase Cycles

One of the main lifespans associated with an SSD is a limit on the number of write/erase cycles per memory cell. This is typically in the range of 3000-100000 cycles depending on the quality of NAND flash.

Each cell can only be erased and rewritten a finite number of times before becoming unable to reliably hold data. After exceeding this limit, that block of cells would be marked as bad and no longer usable. The SSD would reallocate data to other blocks.

Filling up the full capacity of the drive more frequently, saving temporary files, and running demanding write-heavy workloads can more quickly consume these erase cycles. The controller’s wear leveling helps distribute writes, but certain cells may still reach the limit earlier than others.

Wear Leveling Effectiveness

More advanced SSD controllers implementing better wear leveling algorithms and using higher-endurance NAND flash can extend the endurance and life of the drive. Enterprise and server-grade SSDs designed for heavy workloads generally have higher write endurance ratings than consumer models.

Read Disturb Errors

While write/erase cycles reflect an SSD’s endurance, data retention is another important reliability factor. The charge stored in NAND flash cells that represents data will slowly leak over time. This is known as the data retention time.

Reading data from a cell can accelerate this charge loss. The act of reading may unintentionally alter the voltage levels in neighboring cells, causing data errors before the expected retention time. This is called read disturb.

SSD controllers employ error correction code (ECC) to detect and recover from such errors. However, as cell charge levels drift further from their target values over time, the error rate may exceed what ECC can handle, leading to data loss.

Temperature Effects

Higher temperatures also reduce data retention time. Enterprise SSDs designed for server environments are generally rated for longer retention times than consumer models. Proper cooling is important to help minimize read disturb rates.

Write Amplification

The process of erasing cells before a write, known as out-of-place writes, combined with the SSD controller’s garbage collection and wear leveling processes can result in write amplification.

This means the actual amount of data written to the physical chips is a multiple of the write workload received from the host system. The multiplication factor is called the write amplification factor.

Higher write amplification wears the SSD faster. The controller’s algorithms help minimize this, but certain workloads or configurations may still lead to excessive write amplification.

Over-provisioning

Having additional spare capacity that is not visible to the host OS helps reduce write amplification by giving the controller more free space to work with for its internal data management.

Die Failure

The NAND flash memory in SSDs is organized into multilevel cell (MLC) or single-level cell (SLC) architectures. In MLC designs, each cell holds multiple bits of data using different voltage levels.

SLC flash stores just 1 bit per cell and is more reliable and higher performing, but more expensive. Within a NAND die, one or more cells may fail permanently due to manufacturing defects or degradation over time.

SSDs have spare area set aside to replace failed cells. However, if too many cells within a die fail, the entire die may become inoperable, reducing capacity. Die failure rates increase towards the end of an SSD’s usable life as write endurance limits are reached.

Die Failure Effects

The arrangement of flash memory dies in SSD architecture can affect the impact of die failure. In a planar layout, the loss of an entire die reduces capacity. But with 3D NAND designs stacking memory layers vertically, a failed die may only reduce performance instead if the controller can simply avoid using that vertical channel.

Controller Failure

The SSD controller is a critical component that manages all the core functions of reading, writing, erasing, wear leveling, error correction, and interfacing with the host. If the controller fails completely, the SSD will become entirely inoperable. The firmware on the controller can also become corrupted and lead to bugs or instability.

Causes of Controller Failure

Power surges, voltage spikes, lightning strikes, static electricity, overheating, and physical damage from drops or impacts can all potentially damage the controller. Controller failure rates tend to increase over time similar to the NAND flash memory.

Interconnect Failure

The interconnects between the SSD controller and the NAND flash packages can break over time due to material fatigue,rtosion, or damage from external vibration/shock.

This can disrupt communication between the controller and memory, making data in some dies inaccessible and effectively reducing capacity.

Avoiding Interconnect Failure

High quality interconnect materials and architecture will maximize longevity. Care in handling SSDs will minimize interconnect failure by avoiding damage from drops or vibration during transportation or installation.

Firmware Bugs

Firmware is software programmed onto the SSD controller to handle all the data management operations and communicate with the host system. Bugs or compatibility issues in firmware can lead to problems such as:

– Instability and unexpected crashes/resets
– Data corruption or inaccessibility
– Slower performance
– Failure to initialize at boot

Firmware bugs are most common shortly after a new SSD model or controller platform is released, before extensive compatibility testing is done. Users can mitigate issues by avoiding being early adopters of new SSD firmware, performing OS and driver updates, and installing firmware patches released by the manufacturer.

Mitigating Firmware Bugs

Updating to the latest SSD firmware version improves stability by fixing release bugs. For mission-critical data, enterprise SSDs with extensive internal testing and validation offer better firmware maturity and compatibility.

File System Corruption

If critical file system data like the partition table, directory entries, or journal get corrupted, the file system may become unreadable or unstable. Causes include:

– Unexpected power loss during writes
– Firmware bugs
– Deteriorating NAND flash cells
– Excessive bad blocks

The SSD may become inaccessible despite otherwise functional hardware. Resolution requires professional data recovery or reformatting the drive.

File System Robustness

Using a resilient file system like NTFS, ReFS, or EXT4 reduces the chances of file system corruption compared to FAT32. Journaling, atomic writes, redundancy, and checksums provide higher protection. Proper shutdowns and backups are also wise precautions.

Encryption Errors

Some SSDs support full-drive encryption using standards like AES, TCG OPAL, and eDrive. Passwords, keys, or antiforensic measures may become corrupted. As a result, data on the SSD would become permanently inaccessible without the correct decryption credentials.

Encryption should not be relied upon as the sole means of security. Maintaining backups of encryption keys externally provides recovery capability in case of errors.

Avoiding Encryption Errors

Using simpler encryption without antiforensic features reduces the possibility of irrecoverable data loss. Software encryption offers more control compared to sole reliance on hardware encryption built into the SSD controller.

Design Flaws

In rare cases, certain SSD models may have faulty architecture vulnerable to issues like:

– Read/write disturbances between densely packed 3D NAND layers
– Excessive write amplification
– Premature wear from insufficient over-provisioning
– Overheating due to inadequate thermal dissipation

These represent design flaws not detected until the drive is commercially available and tested in diverse real-world environments. Avoiding using SSDs known to have major design flaws minimizes the failure risk.

Catching Design Flaws

Technical in-depth reviews from sources like StorageReview can help uncover SSD models prone to early failures or other design issues before purchasing and deploying widely.

Counterfeit Components

Some gray market SSDs may use fake or low-quality NAND dies or controllers that have not undergone proper qualifications. Counterfeit components tend to have higher failure rates and fewer safeguards against data loss. Reputable SSD vendors selling through authorized channels offer assurance of authentic, tested components.

Avoiding Counterfeits

Checking for certificates of authenticity, verifying serial numbers, and performing parametric testing can help detect SSDs with counterfeit NAND flash or controllers. Brand name products from vetted retailers minimize the changes of bogus components.

Insufficient Validation

Extensive validation testing is necessary to ensure SSDs are compatible with operating systems, platforms, system configurations, and workloads. Some newer SSD models or manufacturers with poor quality control may have compatibility issues leading to blue screens, unexpected freezes, or data corruption.

Choosing Validated SSDs

Reputable vendors with a proven history of reliability testing tend to offer broader compatibility. Reviewer testing with diverse hardware also helps validate quality. Conservative adoption of new SSDs after release allows time for firmware and driver updates resolving initial bugs.

Excessive Temperature

The NAND flash and controller components within SSDs can be damaged by excessive sustained heat. Poor airflow, inadequate heat sinks, or overloaded drive densities can cause some SSDs to exceed safe operating temperatures. The performance throttling or sudden failure that results leads to data loss or corruption.

Monitoring SSD Temperature

Tools like SSD Observer allow monitoring internal temperature during operation to detect dangerous heat buildup. Improving case airflow or reducing drive density mitigates overheating. Enterprise SSDs built for datacenters have higher thermal maximums than consumer models.

Conclusion

SSDs can fail due to write/erase cycle exhaustion, read disturbs, controller malfunction, design flaws, firmware bugs, or environmental factors like overheating. Using enterprise-class drives designed for reliability, monitoring SSD health, providing adequate cooling, and following best practices for deployment and data protection will maximize SSD lifespan and data integrity.

Balancing SSD models optimized for performance, endurance, and storage density appropriate to the use case improves results. Newer technologies like 3D NAND offer higher reliability and longevity due to greater cell density and architectural improvements. Continued evolution of SSD technology including controllers and flash memory will enhance reliability while driving costs down.