Kimi Linear: An Expressive, Efficient Attention Architecture
Source
Hacker News
Published
TL;DR
AI GeneratedKimi Linear is a hybrid linear attention architecture that surpasses traditional full attention methods in various contexts, offering faster processing speeds and superior performance. The core of Kimi Linear is Kimi Delta Attention (KDA), which optimizes memory usage through a refined gating mechanism. It reduces the need for large KV caches by up to 75% and boosts decoding throughput by up to 6 times for long-context tasks. The model has been open-sourced with two versions available for download, showcasing its efficiency in handling tasks with long sequences.