HW-SW Co-Designed System With 3 Core Optimization Pathways For Long-Context Agentic LLM Inference (Cambridge, ICL)
Source
Published
TL;DR
AI GeneratedResearchers from University of Cambridge, Imperial College London, and University of Edinburgh have published a technical paper on optimizing long-context agentic LLM inference tasks. They introduce PLENA, a hardware-software co-designed system with three core optimization pathways to address challenges related to memory walls. PLENA includes efficient hardware implementation, a novel flattened systolic array architecture, and support for FlashAttention to handle memory walls in long-context LLM scenarios. Simulated results show PLENA achieves significantly higher utilization and throughput compared to existing accelerators like the A100 GPU and TPU v6e. The full PLENA system will be open-sourced.