We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.

Back to home

HW-SW Co-Designed System With 3 Core Optimization Pathways For Long-Context Agentic LLM Inference (Cambridge, ICL)

Source

SemiEngineering

Published

TL;DR

AI Generated

Researchers from University of Cambridge, Imperial College London, and University of Edinburgh have published a technical paper on optimizing long-context agentic LLM inference tasks. They introduce PLENA, a hardware-software co-designed system with three core optimization pathways to address challenges related to memory walls. PLENA includes efficient hardware implementation, a novel flattened systolic array architecture, and support for FlashAttention to handle memory walls in long-context LLM scenarios. Simulated results show PLENA achieves significantly higher utilization and throughput compared to existing accelerators like the A100 GPU and TPU v6e. The full PLENA system will be open-sourced.