Nvidia details efficiency of the NVFP4 format for LLM training — new paper reveals how NVFP4 offers benefits over FP8 and BF16
Nvidia's NVFP4 format, designed for Blackwell GPUs, offers efficiency benefits for both training and inference tasks. The format combines compact data representation with a multi-level scaling strategy, achieving accuracy close to BF16 while reducing memory usage and computational cost. Nvidia successfully trained a 12-billion-parameter model on a 10-trillion-token dataset using NVFP4, closely matching FP8 baseline results. Techniques like mixed precision, consistent scaling, stochastic rounding, and outlier handling were crucial for stable training with 4-bit precision. NVFP4 outperformed the MXFP4 format in convergence and data efficiency, showing promise for training large-scale language models efficiently.