'A virtual DPU within a GPU': Could smart hardware hack be behind the innovative efficiency of Deepseek?

A new approach called Dualpipe seems to be the key to Deekseek’s success
An expert describes it as a virtual DPU in the GPU that maximizes bandwidth efficiency
While Deepseek has used only the Nvidia GPUs, one wonders how I would go to AMD instinct

Deepseek AI Chatbot of China has surprised the technology industry, representing a credible alternative to the OpenAi chatpt to a fraction of the cost.

A recent article Deepseek V3 was trained in a group of 2,048 GPU NVIDIA H800: paralyzed versions of H100 (we can only imagine how much more powerful it would be executed in AMD Instinct accelerators!). As reported, it required 2.79 million hours of GPU for the previous, adjusted, adjusted in 14.8 billion tokens and cost, according to the calculations made by The next platform – Only $ 5.58 million.

But exactly how Depseek developers achieved this feat is probably due to an intelligent trick.

A virtual DPU in the GPU itself

First, some background. Deepseek is an advanced language model of the expert mixture (MOE) designed to optimize performance selectively activating only the most relevant parts of its architecture for each task. The third version of the model, Deepseek-V3, presents a total of 671 billion parameters, with only 37 billion activated for any prediction of Token given. This selective activation massively reduces computational costs while maintaining high performance and precision, which will see if you try.

It is easy to be skeptical of Deepseek and the statements made with respect to their training, but the document reveals part of the magic that developers came up to make the most of the paralyzed hardware with which they had to work. This includes the creation of the Dualpipe algorithm for the efficient parallelism of the pipe.

According to information published by Deepseek, Dualpipe overlaps the calculation forward and backward, reduces latency and optimizes data movement through the GPUs. By efficiently managing communication, it minimizes inactivity time (pipe bubbles) and dynamically balances the GPU computer centers (multiprocessor transmission) between calculation and communication, avoiding data transfer bottlenecks as the Scale model.

A commentator in The next platform Dualpipe describes as “essentially creating a virtual DPU in the GPU itself to handle total communication”, which highlights its role in optimization of data transfer efficiency.

The document enters more detail, “to guarantee sufficient computational performance for Dualpipe, we customize the combined communication nuclei of efficient cross nodes (including shipping and combination) to preserve the number of SMS dedicated to communication. The implementation of the Kernels are codesñado with the MOE activation algorithm and the topology of the network of our cluster.

Dualpipe programming example for 8 pp ranges and 20 micro-flops in two directions. Micro-Lotes in the reverse direction are symmetrical for those of the direction forward, so we omit their lot id for the simplicity of the Enlightenment. Two cells locked by a shared black border have a mutually overlapping calculation and communication. (Image credit: Deekseek)

‘A virtual DPU within a GPU’: Could smart hardware hack be behind the innovative efficiency of Deepseek?

Leave a Comment Cancel Reply

Must Read

Leave a Comment Cancel Reply