Back to home

Articles tagged with "LLM Optimization, Hardware-Software Co-Design, AI Accelerators, Long-Context Inference, Deep Learning Systems"

HW-SW Co-Designed System With 3 Core Optimization Pathways For Long-Context Agentic LLM Inference (Cambridge, ICL)

HW-SW Co-Designed System With 3 Core Optimization Pathways For Long-Context Agentic LLM Inference (Cambridge, ICL)

Researchers from University of Cambridge, Imperial College London, and University of Edinburgh have published a technical paper on optimizing long-context agentic LLM inference tasks. They introduce PLENA, a hardware-software co-designed system with three core optimization pathways to address challenges related to memory walls. PLENA includes efficient hardware implementation, a novel flattened systolic array architecture, and support for FlashAttention to handle memory walls in long-context LLM scenarios. Simulated results show PLENA achieves significantly higher utilization and throughput compared to existing accelerators like the A100 GPU and TPU v6e. The full PLENA system will be open-sourced.

SemiEngineering

No more articles to load

We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.