Apple and Nvidia have announced the integration of Apple’s ReDrafter technology into Nvidia’s TensorRT-LLM framework, enabling faster processing of large language models (LLMs) on Nvidia GPUs. ReDrafter, an open-source speculative decoding approach developed by Apple, uses recurrent neural networks to predict future tokens during text generation, combined with beam search and tree attention algorithms.
The collaboration has resulted in significant performance improvements, with Apple reporting a 2.7x speed increase in token generation when testing a production model with tens of billions of parameters on Nvidia GPUs. This acceleration is achieved by moving validation and drafting processes inside the TensorRT-LLM engine, rather than relying on separate engines or runtime processing, which reduces computational overhead.
The implementation includes compatibility with Nvidia’s inflight-batching strategy, which allows simultaneous processing of multiple requests. According to Nvidia, the performance benefits of ReDrafter depend on various factors, including GPU utilization, token acceptance rates, and specific tasks being performed. The technology is now available to developers through the TensorRT-LLM framework, offering potential benefits for production LLM applications across the industry.