Reliability Extension Architecture For Cost-Effective HBM (RPI, ScaleFlux, IBM TJ Watson)
Source
Published
TL;DR
AI GeneratedResearchers from Rensselaer Polytechnic Institute, ScaleFlux, and IBM T.J. Watson Research Center have published a technical paper titled "Making Strong Error-Correcting Codes Work Effectively for HBM in AI Inference." The paper introduces REACH, a controller-managed ECC design that aims to maintain end-to-end correctness and throughput for HBM while tolerating higher raw bit error rates. By implementing a two-level Reed-Solomon scheme, REACH significantly extends device error rate tolerances while reducing ECC area and power consumption compared to traditional methods. This innovation could lead to lower-cost HBM implementations without changing the standard interface.