We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.

Back to home

Google's TurboQuant reduces AI LLM cache memory capacity requirements by at least six times — up to 8x performance boost on Nvidia H100 GPUs, compresses KV caches to 3 bits with no accuracy loss

Source

Tom's Hardware

Published

TL;DR

AI Generated

Google's TurboQuant, a compression algorithm, reduces the memory capacity requirements of AI LLM cache by at least six times, providing up to an 8x performance boost on Nvidia H100 GPUs. It compresses KV caches to 3 bits without any loss in model accuracy. The algorithm eliminates overhead through a two-stage process involving PolarQuant and Quantized Johnson-Lindenstrauss (QJL). TurboQuant achieved perfect downstream scores on various benchmarks and showed strong results in vector search, outperforming baselines. This training-free algorithm, suitable for production inference and large-scale vector search systems, will be presented at ICLR 2026.