What happens when SSD runs out of writes?

Solid state drives (SSDs) have become extremely popular in computers and other devices over the past decade, largely replacing traditional hard disk drives (HDDs) due to their faster speeds, lower power consumption, and lack of moving parts. However, SSDs have a finite number of write/erase cycles before they can no longer reliably store data. So what exactly happens when an SSD reaches its write endurance limit?

How SSDs Work

To understand what happens when an SSD runs out of writes, it helps to first understand how SSDs work. An SSD contains a grid of memory cells made up of floating gate transistors. When data is written to a cell, an electrical charge is applied to the floating gate, which shifts the threshold voltage of the transistor. The presence or absence of charge on the floating gate represents the binary 1s and 0s of data.

To erase a cell, the charge is removed from the floating gate, resetting the cell to a base voltage threshold. This write and erase mechanism allows data to be stored, read, and rewritten to cells.

Write Amplification and Wear Leveling

When data is rewritten to an SSD, it cannot simply overwrite existing cells like a HDD can overwrite magnetic disk sectors. The data must first be erased before being rewritten. This erase-write cycle contributes to wear on SSD cells. In addition, SSDs use a process called write amplification, where individual writes often require multiple reads and erases in order to reorganize and consolidate data in the SSD’s memory blocks. This amplification increases the number of erase cycles.

To extend SSD lifespan, wear leveling algorithms are used to distribute writes as evenly as possible across all cells in the drive. This prevents any single block from wearing out prematurely compared to others. The drive firmware tracks how many erase cycles each block has undergone and attempts to direct writes to blocks with the lowest cycle counts.

Program/Erase Cycles

All SSDs have a maximum number of program/erase cycles that blocks can sustain before wear prevents reliable storage. This is typically around 3000-100000 cycles for modern SSDs. Note that one write operation may result in multiple erase cycles on the drive. The total lifetime writes for an SSD is typically 80-160 times the drive’s capacity, divided by the PE cycle rating.

For example, a 256GB SSD with 100000 PE cycles would be capable of 256GB x 100,000 writes/~160 = 160TB total writes over its usable lifespan under ideal conditions. Actual lifetime writes may be lower due to write amplification.

Write Endurance

The write endurance of a SSD refers to its total lifetime writes capacity – i.e. the total amount of data that can be written to the drive before wear makes cells unreliable. Several factors affect write endurance:

  • PE cycles – The higher the PE rating, the more writes cells can endure.
  • Wear leveling effectiveness – Better wear leveling allows writes to be distributed evenly.
  • Over-provisioning – Extra unused capacity allows more area for wear leveling.
  • DRAM cache – Absorbs writes to reduce flash wearing.
  • Workload – Data compression and deduplication reduce writes.
  • Drive capacity – Higher capacity means more cells to wear out.

Higher-end enterprise SSDs maximize these factors to improve endurance. Consumer SSDs have lower ratings as they optimize for cost.

Write Endurance Estimates

The SSD firmware uses proprietary algorithms to estimate the remaining write endurance of the drive based on measured cell wear and historical write workloads. Tools provided by SSD manufacturers can give you an endurance estimate percentage or total remaining terabytes written (TBW).

For example, Samsung’s Magician SSD toolbox will show the percentage of remaining endurance and total TBW remaining for its drives. The SMART attributes can also provide raw values related to wear that tools can interpret.

What Happens When Endurance is Exceeded

So what exactly happens when the estimated write endurance limit of a SSD is finally exceeded? There is no immediate catastrophic failure or complete bricking of the drive. Rather, the performance and reliability of the SSD will slowly degrade.

Even with wear leveling, individual cells and blocks that have undergone the most erase cycles will wear out first. As more and more blocks reach endurance limits:

  • Read/write speeds will drop as the drive relies more on error correction.
  • Latency and I/O errors will increase as the drive firmware has to work harder to write data.
  • The risk of uncorrectable read errors goes up, leading to data loss.
  • Bad blocks may be retired, reducing usable capacity.
  • Write failures and reallocation events will become more common.

In most cases, the performance of a worn out SSD will deteriorate to unacceptable levels long before total failure occurs. The degradation happens gradually though, rather than all at once.

SSD Failure Modes

When SSD cells wear out, it triggers certain failure modes within the SSD:

  • Bit Errors: Unable to correctly read/write bits due to worn out cells.
  • Block/Page Failures: Entire blocks or pages exceed threshold voltage margins.
  • Die/Chip Failures: Accumulated cell failures cause a memory die or chip to fail.
  • Failure to Program: High voltage cannot program worn cells.
  • Failure to Erase: Cells cannot be erased due to voltage margins.
  • Read Disturb: Read operations degrade neighboring cells.

These modes compound endurance problems. The SSD will attempt to resolve them using error correction and bad block reallocation. But eventually the drive will run out of spare capacity needed to remap blocks.

SSD Reallocation Spare

To compensate for bad NAND blocks, SSDs set aside extra spare capacity that can be substituted for failed blocks. This over-provisioning space may be 5-25% of an SSD’s total flash capacity. Extra spare area improves performance and extends the useful life of the drive.

When the SSD controller detects a bad block that has exceeded program/erase cycles, it will retire it and remap writes to a spare block with lower wear. This helps avoid failures and data loss. However, once all spare area is used up, no more reallocation is possible and write failures occur.

Results of Exceeding Endurance

The consequences of exceeding rated endurance depend on workload. Light-use SSDs may still function beyond specifications. But heavy workloads like databases will quickly see impacts:

  • Performance drops below requirements.
  • Uncorrectable errors lead to data loss.
  • Failures disrupt operations and crash systems.
  • Data recovery needed from backups.
  • Replacement SSDs needed to restore proper function.

Before the SSD completely fails, systems may experience crashes, hangs, lags, corrupted data, and errors resulting from the accumulating cell failures and reallocation issues.

Monitoring Wear Levels

To avoid exceeding endurance limits, monitoring tools can track SSD wear indicators:

  • Lifetime Writes: Percentage or TBW used, via SMART attributes.
  • Spare Blocks: Remaining over-provisioned area.
  • Reallocated Sectors: Rising bad block remaps indicate wear.
  • Errors & Events: Uncorrectable errors and failures.

These can provide warnings for replacement before failure causes problems. Higher risk databases and disks nearing expected lifetime writes should be watched closely.

Extending SSD Life Span

Avoiding unnecessary writes can help extend SSD lifespan. Methods include:

  • Reduce OS writes using TRIM, idle garbage collection, etc.
  • Minimize swap file size and turn off hibernation.
  • Selectively disable features like search indexing.
  • Use RAM disks for temporary files instead of SSD.
  • Choose higher TBW rating when purchasing SSD.

For critical SSDs at risk of exceeding endurance, it may make sense to proactively replace the drive once warning thresholds are reached.

Conclusions

Exceeding the rated endurance of an SSD does not instantly break the drive. Instead, gradually accumulating cell wear slowly degrades performance and reliability. Reallocated spares help avoid failure, until they are fully consumed. Well before complete failure, the SSD will exhibit unacceptable errors and slow speeds. Monitoring wear levels allow worn out SSDs to be replaced before problems occur. With proper SSD selection and workload management, storage endurance limits can be extended.