What degrades a SSD?

Solid state drives, also known as SSDs, are a type of storage device that uses flash memory rather than mechanical platters like traditional hard disk drives (HDDs). SSDs have become increasingly popular in personal computers and data centers thanks to their faster speeds, lower latency, lower power consumption, and lack of moving parts compared to HDDs. However, SSDs aren’t without their downsides – one of the main ones being that they can wear out and degrade over time with heavy usage. In this article, we’ll look at what factors can degrade an SSD and shorten its lifespan.

Write Amplification

One of the key factors that affects SSD lifespan is write amplification. Write amplification refers to the amount of data actually written to an SSD compared to the logical data written by the host system. For example, if the host system needs to overwrite 4KB of data, the SSD may need to write 12KB physically due to garbage collection, wear leveling, and other processes. This amplified write load contributes to faster wear on the NAND flash cells.

Some key causes of write amplification include:

– Small random writes – Small random writes are highly amplified because the SSD needs to read existing data, write the new data to a new location, then erase the old block. Sequential writes have much lower amplification.

– Garbage collection – Garbage collection consolidates data and frees up blocks, but requires extra writing to relocate valid data. More frequent garbage collection increases amplification.

– Wear leveling – Wear leveling ensures all cells wear evenly, but requires data to be rewritten to underutilized blocks. More aggressive wear leveling increases writes.

– Over-provisioning – Extra spare area can reduce write amplification but uses up storage capacity. Insufficient over-provisioning increases amplification.

– File system – Some file systems are optimized to reduce write amplification more than others.

Mitigating write amplification is critical for extending SSD lifespan. Using a host system with highly sequential write patterns, optimized file system, sufficient over-provisioning, and advanced SSD controller can all help minimize unnecessary writes and amplification.

Read Disturb Errors

Read disturb errors can also degrade SSD health over time. NAND flash cells have a limited number of read cycles before read disturb effects set in. When reading data from a flash cell, nearby cells can become energized and experience voltage shifts that cause data loss or read errors. The likelihood of read disturb increases as flash cells wear out from program/erase cycles.

Read disturb has a cumulative effect – the more data is read over time, the higher probability read disturb will occur. Some ways SSD controllers mitigate read disturb include:

– Read scrubbing – Proactively reading and rewriting data across the SSD to refresh voltages and avoid disturb errors.

– Wear leveling – Ensuring all cells wear evenly to prevent excessive reads on small number of cells.

– Error correction – ECC and advanced signal processing helps recover data in marginal cells.

– Caching – Caching frequently read data in SLC cache regions helps reduce reads on main storage array.

For most consumer workloads, read disturb is rarely an issue. But in extremely read-intensive enterprise environments with petabytes of data reads, read disturb can degrade SSDs. Carefully modeling and testing endurance with realistic read patterns is necessary.

Write Endurance

The NAND flash cells in SSDs can only withstand a finite number of program/erase cycles before wear begins to degrade write endurance and increase error rates. Most SSDs are rated for a certain number of terabytes written (TBW) over their lifetime – for example, 100 TBW, 300 TBW, or more for enterprise models.

The P/E cycle endurance depends heavily on the type of NAND flash:

NAND Type P/E Cycles
SLC 100,000
MLC 3,000-10,000
TLC 1,000-3,000
QLC 100-1,000

As the table shows, higher density NAND such as TLC and QLC wear out much faster. Consumer SSDs most often use TLC NAND today. Enterprise SSDs may use MLC, but higher endurance SLC NAND is sometimes used for write caching.

The SSD controller uses wear leveling to spread writes across all available cells to maximize the lifetime TBW. When the first cells begin to wear out, the drive gradually loses capacity and performance. The controller allocates reserve NAND capacity to compensate, until the usable capacity hits the manufacturer’s end-of-life specification, typically around 80% of original.

Heavy write workloads, insufficient over-provisioning, and poorer wear leveling algorithms can degrade write endurance more rapidly. Sustained writes exceeding the SSD’s target workload reduces lifetime TBW. For most consumer workloads, endurance degradation is unlikely, but enterprises need to match SSD TBW ratings to workload requirements.

Retention Errors

NAND flash cells store charge as the presence or absence of electrons on the floating gate. Over time, the charge slowly leaks away, leading to retention errors as the threshold voltage shifts. If the voltage shift is large enough, the SSD controller can no longer reliably read the cell state – an uncorrectable bit error occurs.

SSD retention is rated for a certain amount of time at a specific temperature when the drive is powered off, typically 1 year at 30C for consumer drives and 3 months at 40C for enterprise. However, at higher temperatures or over longer durations, bit errors increase steadily.

The rate of charge leakage, and thus retention errors, approximately doubles for every 10C increase. Server environments often run drives at higher temperatures, exacerbating retention issues. Retention degradation is also worse for TLC and QLC NAND compared to SLC/MLC.

To mitigate retention issues, SSD controllers use periodic read scrubbing to detect marginal cells and rewrite them to full charge levels. More advanced LDPC error correction also helps recover data in cells with some voltage drift. Even with these techniques, all NAND flash wears out after thousands of program/erase cycles that degrade the insulating layer separating the floating gate.

Physical Damage

Like hard disk drives, physical damage can rapidly degrade SSD reliability:

– Connector damage – Damaged pins or solder joints can cause drive/controller malfunctions.

– Contamination – Particles entering the drive can lead to electrical shorts.

– Overheating – Excessive heat can damage the drive electronics or NAND chips.

– Shock/Vibration – Excess force damages solder joints or internal structures.

– Humidity – Condensation shorts electrical contacts.

– Corrosion – Corrosion grows on connectors or PCB traces.

Most physical damage requires the drive be repaired or replaced. Some failure modes the SSD controller may detect and isolate, such as failed NAND die or read-only operation. But issues like electrical shorts will likely cause complete failure.

Careful handling, mounting, and environmental control is necessary to protect SSDs from physical degradation, especially in mobile or ruggedized use cases.

Conclusion

In summary, the main factors that degrade SSD health and longevity include:

– Write amplification wearing out the NAND cells

– Read disturb errors accumulating over time

– Program/erase cycles exceeding drive endurance rating

– Charge leakage and retention issues

– Physical damage to the components

SSD controllers employ various techniques to mitigate wear like wear leveling, read scrubbing, caching, and advanced error correction. But inherent NAND flash limitations mean SSDs will eventually wear out. For most consumer workloads, SSD lifespan is ample, but understanding the factors impacting health helps optimize use cases like enterprise databases. Monitoring usage, environmental conditions, and SMART attributes can provide early warning of degradation issues.

Leave a Comment