Back to home
Technology

Efficient Synchronous Dataflow Execution For GPUs (NVIDIA, UW-Madison)

Source

SemiEngineering

Published

TL;DR

AI Generated

Researchers from NVIDIA and the University of Wisconsin-Madison have published a technical paper titled “Kitsune: Enabling Dataflow Execution on GPUs with Spatial Pipelines.” The paper discusses the challenges of using GPUs for deep learning applications due to their bulk-synchronous execution model. The researchers introduce Kitsune, a set of primitives and an end-to-end compiler based on PyTorch Dynamo, to enable dataflow execution on GPUs. Kitsune shows significant performance improvements and reduced off-chip traffic for both inference and training tasks across various applications.

Read Full Article

Similar Articles

Intel has reportedly cancelled discrete gaming GPUs for the upcoming Xe3P Arc "Celestial" family — gaming GPU remains uncertain even for the next-gen Xe4 "Druid" lineup that lands in 2027

Intel has reportedly cancelled discrete gaming GPUs for the upcoming Xe3P Arc "Celestial" family — gaming GPU remains uncertain even for the next-gen Xe4 "Druid" lineup that lands in 2027

Intel has reportedly scrapped plans for discrete gaming GPUs in the upcoming Xe3P Arc "Celestial" family, leaving the fate of gaming GPUs uncertain even for the Xe4 "Druid" lineup expected in 2027. The Celestial GPU was originally intended for a 2025 launch but was replaced by Battlemage, with Xe3P now serving other purposes. Intel's focus seems to be shifting towards AI applications, with leaks suggesting a potential late-2027 release for the Druid architecture. The future of dedicated gaming GPUs from Intel remains speculative, with the possibility of a revival with the Druid lineup.

Tom's Hardware
MIT Technology Review

Three reasons why DeepSeek’s new model matters

DeepSeek's new V4 model is significant for three key reasons. Firstly, it offers high performance at a fraction of the cost of comparable models, making cutting-edge AI capabilities more accessible. Secondly, V4 introduces a new approach to memory efficiency by handling 1 million tokens in its context window, reducing computing power and memory usage significantly. Lastly, V4 marks a shift towards Chinese chip optimization, specifically for Huawei's Ascend chips, challenging the dominance of US chip giant Nvidia and potentially signaling China's progress in building a parallel AI infrastructure.

MIT Technology Review
SpaceX says it is going to begin manufacturing GPUs — $1.75 trillion IPO listing reportedly includes in-house GPU production

SpaceX says it is going to begin manufacturing GPUs — $1.75 trillion IPO listing reportedly includes in-house GPU production

SpaceX's confidential $1.75 trillion IPO filing reveals plans to manufacture its own GPUs, investing billions in internal processor production due to a lack of long-term supply agreements with silicon suppliers. The company's intention to build GPUs, not specialized AI accelerators, is highlighted, with the naming convention still uncertain. While SpaceX's CEO confirmed plans for high-volume semiconductor manufacturing, the specifics of the GPUs remain unclear, raising questions about potential competition with existing AI GPU manufacturers like AMD and Nvidia. The S-1 form's confidential nature prevents verification of its content, leaving room for speculation on SpaceX's semiconductor endeavors.

Tom's Hardware
Testing DirectStorage with GPU decompression — do Blackwell GPUs have the upper hand?

Testing DirectStorage with GPU decompression — do Blackwell GPUs have the upper hand?

The article discusses the testing of DirectStorage with GPU decompression, focusing on whether Blackwell GPUs have an advantage in handling this technology. DirectStorage aims to optimize storage technology for faster asset streaming and reduced CPU overhead, with support for GPU decompression added in version 1.1. While Nvidia GPUs initially struggled with DirectStorage, Blackwell GPUs, like the 5090, showed improved performance with GPU decompression enabled. Tests on various Blackwell GPUs, including the 5070 and 5060, demonstrated consistent performance gains with DirectStorage. The article explores the potential reasons behind Blackwell GPUs handling GPU decompression more effectively, pointing to advancements in architecture and scheduling capabilities.

Tom's Hardware

We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.