Support for Reliability, Availability, and Serviceability (RAS) is one of the
quintessential features of computing systems targeting the server and missioncritical
markets. Among these RAS features, Chipkill* stands out as the most
crucial for main memory protection. IBM Chipkill protects the main memory
from the failure of an entire memory chip, as well as multi-bit faults from any
portion of a memory chip. Similar technologies from other vendors are Single
Device Data Correction (SDDC) from Intel, Sun Extended ECC* and HP
However, some advanced memory technologies (such as GDDR5) do not
allow traditional SDDC implementation, since their specification does not
include extra devices to store error correction codes (ECC codes).
Some future high performance computing products hitting the server market
will be based on these advanced memory technologies. In this article we
propose a method to provide SDDC (single device data correct) support at
the memory controller level for memory technologies that inherently have no
RAS support for memory contents protection. Specifically, we focus on how
to provide single-device SDDC support for GDDR5 memory. The technique
allows the failure of 1/8 of the memory devices to be tolerated by using 25
percent of the memory to store error correction codes.
We also describe how the technique can be implemented for RAS-less memory
technologies feeding a wider data bus than GDDR5 (such as DDR3, which in
fact uses narrower devices). This opens the possibility to offer high reliability
with cheap DIMM devices. We also describe how to provide SDDC support
without the use of lockstepped memory channels.