Google's TurboQuant reduces AI LLM cache memory capacity requirements by at least six times — up to 8x performance boost on Nvidia H100 GPUs, compresses KV caches to 3 bits with no accuracy loss
Source
Published
TL;DR
AI GeneratedGoogle's TurboQuant, a compression algorithm, reduces the memory capacity requirements of AI LLM cache by at least six times, providing up to an 8x performance boost on Nvidia H100 GPUs. It compresses KV caches to 3 bits without any loss in model accuracy. The algorithm eliminates overhead through a two-stage process involving PolarQuant and Quantized Johnson-Lindenstrauss (QJL). TurboQuant achieved perfect downstream scores on various benchmarks and showed strong results in vector search, outperforming baselines. This training-free algorithm, suitable for production inference and large-scale vector search systems, will be presented at ICLR 2026.