- ReDrafter delivers 2.7x more tokens per second compared to traditional auto-regression
- ReDrafter could reduce latency for users by using fewer GPUs
- Apple has not said when ReDrafter will be implemented on rival AI GPUs from AMD and Intel.
Apple has announced a collaboration with Nvidia to accelerate large language model inference using its open source technology, Recurrent Drafter (or ReDrafter for short).
The partnership aims to address the computational challenges of autoregressive token generation, which is crucial to improving efficiency and reducing latency in real-time LLM applications.
ReDrafter, introduced by Apple in November 2024, takes a speculative decoding approach by combining a recurrent neural network (RNN) draft model with beam search and dynamic tree attention. Apple benchmarks show that this method generates 2.7 times more tokens per second compared to traditional automatic regression.
Could it extend beyond Nvidia?
Through its integration into Nvidia’s TensorRT-LLM framework, ReDrafter expands its impact by enabling faster LLM inference on Nvidia GPUs widely used in production environments.
To accommodate ReDrafter’s algorithms, Nvidia introduced new operators and modified existing ones within TensorRT-LLM, making the technology available to any developer looking to optimize performance for large-scale models.
In addition to speed improvements, Apple says ReDrafter has the potential to reduce user latency and requires less GPU. This efficiency not only reduces computational costs but also reduces power consumption, a vital factor for organizations managing large-scale AI deployments.
While the focus of this collaboration remains on Nvidia’s infrastructure for now, it’s possible that similar performance benefits could extend to rival GPUs from AMD or Intel at some point in the future.
Advances like this can help improve the efficiency of machine learning. As Nvidia says, “This collaboration has made TensorRT-LLM more powerful and more flexible, allowing the LLM community to innovate more sophisticated models and easily deploy them with TensorRT-LLM to achieve unparalleled performance on Nvidia GPUs. These new features open up exciting possibilities, and we eagerly anticipate the next generation of advanced community models that leverage the capabilities of TensorRT-LLM, driving further improvements in workloads. LLM”.
You can read more about the collaboration with Apple on the Nvidia Developer Tech Blog.