A bug that taught me more about PyTorch than years of using it
Source
Published
TL;DR
AI GeneratedA PyTorch bug led to a loss plateau, revealing a niche issue with the PyTorch backend on Apple Silicon GPUs. The bug affected the encoder weights, causing them to freeze during training due to a GPU kernel bug. The bug was traced to addcmul_ and addcdiv_ operations failing on non-contiguous memory layouts, impacting the Adam optimizer. The fix involved making weights contiguous at initialization and upgrading to PyTorch ≥2.4. The investigation process uncovered insights into PyTorch internals, memory layouts, and kernel implementations. The bug was fixed locally, and a PR was submitted to address similar issues in other operations.