Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 family of open AI models, a new technique that can increase inference speeds by up to three times without degrading quality. This addresses a key bottleneck in large language models where performance is often limited by memory bandwidth rather than raw computing power.
The enhancement uses a form of speculative decoding, where a smaller, faster "drafter" model proposes several future tokens of text at once. The main, more powerful Gemma 4 model then verifies these proposed tokens in a single parallel pass, instead of generating them one by one. This process allows for significantly faster text generation for applications on local computers, mobile devices, and in the cloud. The drafters are available under an open-source license.