- A new approach called Dualpipe seems to be the key to Deekseek’s success
- An expert describes it as a virtual DPU in the GPU that maximizes bandwidth efficiency
- While Deepseek has used only the Nvidia GPUs, one wonders how I would go to AMD instinct
Deepseek AI Chatbot of China has surprised the technology industry, representing a credible alternative to the OpenAi chatpt to a fraction of the cost.
A recent article Deepseek V3 was trained in a group of 2,048 GPU NVIDIA H800: paralyzed versions of H100 (we can only imagine how much more powerful it would be executed in AMD Instinct accelerators!). As reported, it required 2.79 million hours of GPU for the previous, adjusted, adjusted in 14.8 billion tokens and cost, according to the calculations made by The next platform – Only $ 5.58 million.
But exactly how Depseek developers achieved this feat is probably due to an intelligent trick.
A virtual DPU in the GPU itself
First, some background. Deepseek is an advanced language model of the expert mixture (MOE) designed to optimize performance selectively activating only the most relevant parts of its architecture for each task. The third version of the model, Deepseek-V3, presents a total of 671 billion parameters, with only 37 billion activated for any prediction of Token given. This selective activation massively reduces computational costs while maintaining high performance and precision, which will see if you try.
It is easy to be skeptical of Deepseek and the statements made with respect to their training, but the document reveals part of the magic that developers came up to make the most of the paralyzed hardware with which they had to work. This includes the creation of the Dualpipe algorithm for the efficient parallelism of the pipe.
According to information published by Deepseek, Dualpipe overlaps the calculation forward and backward, reduces latency and optimizes data movement through the GPUs. By efficiently managing communication, it minimizes inactivity time (pipe bubbles) and dynamically balances the GPU computer centers (multiprocessor transmission) between calculation and communication, avoiding data transfer bottlenecks as the Scale model.
A commentator in The next platform Dualpipe describes as “essentially creating a virtual DPU in the GPU itself to handle total communication”, which highlights its role in optimization of data transfer efficiency.
The document enters more detail, “to guarantee sufficient computational performance for Dualpipe, we customize the combined communication nuclei of efficient cross nodes (including shipping and combination) to preserve the number of SMS dedicated to communication. The implementation of the Kernels are codesñado with the MOE activation algorithm and the topology of the network of our cluster.