Algorithm–HW Co-Design Framework for Accelerating Attention in Large-Context Scenarios (Cornell)
Source
Published
TL;DR
AI GeneratedResearchers at Cornell University have published a technical paper introducing a new framework called LongSight, designed to accelerate attention in large-context scenarios. The framework leverages a compute-enabled CXL memory device to offload Key-Value (KV) cache storage and retrieval, enabling support for context lengths of up to 1 million tokens for Llama models. By utilizing this framework, the value of LPDDR DRAM is elevated to that of high-end HBM, enhancing performance efficiency. The paper details how LongSight addresses the challenges of large input context windows in transformer-based models, improving output accuracy and personalization.