What is usually found only in high end servers mainframe computers and RAID storage devices?

In the opening paragraphs, a quick answer to the question posed in the title is that error correcting code memory, also known as ECC memory, is a type of computer data storage commonly found in high end servers, mainframe computers, and RAID storage devices. ECC memory has additional integrated circuits that can detect and correct errors in data by reconstructing bits calculated incorrectly. This allows ECC memory to be more reliable and stable compared to non-ECC memory.

What is ECC Memory?

ECC stands for error correcting code. ECC memory is a type of computer data storage that can detect and correct the most common kinds of internal data corruption. ECC technology works by adding parity data to the contents of a memory device. The extra parity bits allow errors to be detected and corrected by examining the data and checking for inconsistencies. ECC memory is able to identify and repair single-bit errors and detect (but not repair) double-bit errors.

Normal computer memory chips do not have any built-in error detection or correction capability. Any errors that occur in non-ECC memory will accumulate until they overwhelm the system or the data is discarded. Undetected memory errors can lead to system crashes, data corruption, and other problems. ECC technology helps prevent this by fixing errors on the fly through redundancy.

Some key capabilities of ECC memory include:

  • Error detection – ECC can detect any single-bit error and detect (but not correct) double-bit errors.
  • Error correction – Single-bit errors can be corrected on the fly as data passes through the ECC module.
  • Data integrity – By fixing errors, ECC helps prevent data corruption and system crashes due to memory errors.
  • Higher reliability – ECC makes memory more reliable and stable compared to non-ECC alternatives.

Why is ECC Memory Used in Servers and Mainframes?

ECC memory plays an important role in achieving high reliability and uptime in critical computing systems like servers and mainframes. There are a few key reasons why ECC is commonly deployed in these environments:

  • Large memory capacities – Servers often contain hundreds of gigabytes of RAM or more. The larger the memory, the higher the chance of occasional errors simply due to the law of large numbers.
  • Critical applications – Server downtime and data corruption can have major business impacts when supporting mission-critical, high-value applications and large numbers of end users.
  • High memory utilization – Server workloads often stress the memory subsystem heavily for long durations, increasing the probability of errors over time.
  • Sensitive data – Data integrity is extremely important for sensitive information like financial data, medical records, transaction information, etc.

ECC provides an extra layer of data integrity protection that helps prevent random memory errors from accumulating and impacting the server’s ability to operate reliably. The costs associated with server crashes, failed transactions, data restoration, and other impacts typically far outweigh the marginal additional cost of ECC memory. For mainframes and high-end servers that need to maximize uptime, ECC is practically a necessity.

How Does ECC Memory Work?

The ECC technology works by adding redundant parity bits to each data word stored in memory. These parity bits are calculated based on the value of the data bits. When the data words are read out from memory, the ECC module checks the parity bits against the data bits. If no error has occurred, the parity bits will match the data value as expected. But if one or more bits have changed state due to a memory error, the parity bits will indicate the discrepancy.

For single-bit errors, the ECC logic can determine exactly which bit needs to be flipped back to the original value based on the parity bits. This allows single-bit errors to be corrected on the fly as data passes through the ECC module. Double-bit errors can also be detected, but there is not enough redundancy to correct them (only enough to detect the error).

There are different ECC schemes that vary in the number of extra parity bits used. Common implementations in server memory include:

  • SECDED – Single error correct, double error detect
  • Chipkill – Corrects multiple bit errors within a single DRAM chip
  • DEC – Double error correct (can repair two simultaneous bit errors)

The most widely used ECC scheme is single error correct, double error detect (SECDED). This strikes a balance between robustness and low overhead. SECDED only requires a few extra ECC bits per 64-bit memory word to implement single error correction and double error detection capabilities.

SECDED ECC Example

Here is a simplified example of how SECDED ECC works:

  1. Take a 64-bit data word to store in memory. This will hold the actual data.
  2. Calculate 8 additional check bits based on the 64 data bits. These check bits provide the redundancy.
  3. The full 72-bit code word (64 data bits + 8 check bits) is written to memory.
  4. Later, when that code word is read from memory, the ECC logic re-generates the 8 check bits from the 64 data bits.
  5. The re-generated check bits are compared to the original check bits that were stored.
  6. If they match, no error is detected – the data can be sent to the CPU.
  7. If they do not match, the ECC logic can locate and fix single-bit errors based on the mismatch.

This allows the ECC memory to detect and correct simple 1-bit errors. By correcting memory errors instantly rather than allowing them to accumulate, overall system stability and reliability is greatly improved.

ECC Memory in RAID Storage

In addition to servers and mainframes, ECC memory also plays an important role in RAID storage systems. RAID (Redundant Array of Independent Disks) involves combining multiple disk drives together into a logical unit for greater storage capacity, redundancy, and/or performance.

Many RAID systems use ECC memory to gain the same reliability advantages as servers. Storage arrays often use large amounts of memory for cache, buffers, metadata, etc. And like servers, downtime and data errors can have huge business impacts. ECC provides an additional safeguard against data loss or corruption on storage devices.

Some examples of how ECC is used in RAID storage systems include:

  • RAID controller cache – The RAID controller contains cache memory to speed up repetitive read/write operations. ECC protects this cache memory from corruption that could degrade performance or cause errors.
  • Read/write buffers – Buffers used to coordinate data transfers between multiple drives are protected by ECC.
  • Metadata storage – Metadata about the structure of the RAID system is stored in ECC memory so errors do not impact operations.

Overall, the benefits of ECC for RAID storage are similar to servers – improved reliability and resilience. Critical business data on a high-performance RAID system deserves the best memory protection available.

Disadvantages of ECC Memory

Although ECC provides valuable error detection and correction capabilities, it also comes with some disadvantages and limitations including:

  • Increased cost – ECC modules are more expensive than standard non-ECC alternatives because of the extra integrated circuits required.
  • Lower capacities – For a given memory technology, ECC DIMMs generally have lower usable capacities since some space is taken up by the parity bits.
  • Performance impact – ECC adds some processing overhead to generate and check the parity bits, which can decrease performance slightly.
  • Detection only on multi-bit errors – While single-bit errors can be corrected, multi-bit errors can only be detected, not repaired by typical ECC implementations.

These downsides are relatively minor compared to the reliability enhancement ECC offers in enterprise environments. The additional costs are justified for mission-critical server and storage workloads where downtime is simply unacceptable. However, for less critical applications like desktop PCs, ECC memory is typically overkill and standard non-ECC DIMMs are sufficient.

Conclusion

ECC memory with built-in error checking and correction capabilities provides a crucial data integrity safeguard for high-end servers, mainframes, and RAID storage systems. The ability to detect and repair memory errors helps prevent system crashes, data corruption, and other issues caused by random memory faults. While ECC memory comes at an additional cost, this investment pays off by minimizing disruptive outages and loss of critical business data. For applications where uptime and fault tolerance are imperative, the benefits of ECC memory justify its widespread use in enterprise infrastructure.