How we made state-of-the-art speech synthesis scalable with Modular

We’ve partnered with Modular to supercharge Inworld TTS, combining our state-of-the-art voice quality with Modular's world-class serving stack to deliver breakthrough speed and affordability for every developer.

The challenge

Real-time AI applications, particularly speech synthesis, demand more than traditional machine learning infrastructure can provide. The computational intensity of generating realistic, low-latency speech makes it a significant challenge. To enable scalable and economically viable voice AI applications, specialized APIs are essential. But high-quality voice APIs are usually either expensive, slow, or both, while cheaper and faster alternatives often lack realism. So we had to design our TTS to eliminate this trade-off.

Solving this required more than just optimizing our model; it demanded a fundamental redesign of the entire inference stack. This is where our collaboration with Modular began.

The solution

Our partnership with Modular represents a co-engineered approach, where both companies’ engineering teams worked together to make Modular’s MAX Framework serve our proprietary text-to-speech model. Together we went from engagement to the world's most advanced speech pipeline on NVIDIA Blackwell 200 GPU in less than 8 weeks.

By using MAX we achieved a truly remarkable improvement both for the latency and throughput. In the streaming mode, the API now returns the first two seconds of synthesized audio on average ~70% faster if compared to vanilla vLLM-based implementation. This allowed us to serve more QPS with lower latency and eventually offer the API at a ~60% lower price than would have been possible without Modular's stack [1].

Throughput vs latency

Let’s dive deeper into how this was done by talking about modeling first to better understand what exactly we had to optimize.

TTS architecture

Inworld TTS relies on adapting and scaling a cutting-edge, open-source-inspired tech stack. The model architecture is a Speech-Language Model (SpeechLM) built upon an in-house neural audio codec and an LLM backbone.

Two-Stage TTS Architecture

Its training process involved three key stages:

We pre-trained the model on a massive dataset of ~1 million hours of diverse audio and ~200 billion text tokens to build a foundational understanding of human speech, prosody, and language.
The model was then refined through supervised fine-tuning on ~200,000 hours of high-quality annotated audio to teach it performing speech synthesis.
Finally, we used RLHF on a curated multilingual dataset to enhance the model's stability.

Ultimately, this training approach produced a model that is inherently robust, covering the diverse voices and challenging edge cases found in the real world.

Performance optimizations

This straightforward two-component model design the system to be streaming-ready out-of-the-box, which is essential for real-time use cases. When streaming, clients receive the first meaningful audio segment (approximately two seconds long) within a median latency as low as 200ms. This chunk can then be played-back simultaneously, while waiting for the rest of audio to be synthesized, which usually takes less time than the initial playback.

The tricky part is that the two components have different performance characteristics, which can lead to bottlenecks. For example, the audio decoder can get stuck waiting for the SpeechLM to produce tokens, wasting valuable GPU time. To overcome this, the Modular team worked closely with us to port the original implementation - thousands of lines of PyTorch modeling code - into MAX’s Python-based graph format. This enabled efficient, end-to-end execution of the full TTS pipeline with high throughput and low latency on NVIDIA Blackwell GPUs.

The Inworld TTS Serving Stack

At the heart of the Inworld TTS serving stack are two foundational technologies that are part of the Modular Platform:

🧑🏻‍🚀 MAX is a universal high-performance AI serving framework built to deliver state-of-the-art performance across GPU and CPU platforms. It combines advanced batching, graph-level optimizations, memory planning, and fine-grained kernel scheduling in a single containerized runtime with an OpenAI-compatible serving endpoint.
🔥 Mojo is a systems programming language focused on AI kernels. It bridges the gap between Python's ease of use and speed-of-light performance, offering fine-grained control over memory layout, parallelism, and vectorized execution. Mojo serves as the foundational layer for writing kernels and graph transformations within MAX - making the entire stack programmable and portable, without sacrificing efficiency.

MAX is a vertically integrated, highly-performant AI inference stack

MAX and Mojo enabled deep system-level optimization across the TTS stack. MAX’s streaming-aware scheduler, designed to minimize time-to-first-audio, combined with its optimized kernel library to deliver a ~1.6X speedup on SpeechLM. Additional system-level gains came from enabling faster data types in the audio decoder, overlapping dependent kernel execution, and optimizing both sampling logic and memory allocation.

Mojo adds another layer of performance and flexibility through its ability to define high-efficiency custom kernels. Using this, to accelerate streaming, we introduced a tailored silence-detection kernel that runs directly on the GPU, enabling on-device output processing and maximizing GPU utilization. We’re also actively exploring additional enhancements to streamline intermediate token processing - specifically, eliminating redundant memory transfers between the SpeechLM and audio decoder, which will further reduce latency and improve throughput.

What this means for your applications

Inworld TTS already uses this optimized infrastructure in production with measurable improvements in both user experience and operational efficiency. This validation was crucial as it demonstrated that the Modular Platform can deliver on its promises in real-world scenarios. When you build with Inworld, your end users get the direct benefits of our Modular collaboration:

Deliver truly instant interactions. Thanks to MAX's streaming-aware scheduler, your application gets the first chunk of audio in as little as 200ms, eliminating awkward pauses and keeping users immersed.
Scale your application without fear of cost. By optimizing the entire stack for high throughput, we cut the price by ~60%. You can now serve more users and deploy rich voice experiences at a cost that is ~20x lower than alternatives [2].
Ensure seamless performance under load. Our architecture is built for high load, ensuring your application can serve any QPS you need. The user experience remains seamless and responsive, even during traffic spikes.

We'll soon share exciting stories about how our customers are winning due to our partnership with Modular.

The future of AI infrastructure

Our collaboration with Modular is a glimpse into the future of accessible AI infrastructure. We envision a world where developers can focus on delivering the best user experience for their AI applications without worrying about the underlying complexities of hardware optimization, vendor lock-in, or infrastructure scalability.

The improvements we've demonstrated together prove that strategic partnerships can accelerate innovation across the entire AI ecosystem. By combining Modular's infrastructure expertise with our modeling solutions, we're pushing the boundaries of what's possible in AI. More is yet to come, so stay tuned!

Appendix

[1] Load testing was performed in a Kubernetes environment using variable-length queries of approximately 10 seconds. MAX v25.4 and vLLM v0.9.1 were tested on the same machine: Ubuntu 22.04.5, Intel® Xeon® CPU (256 cores), 1 NVIDIA B200 GPU (driver v570.133.20, CUDA v12.8).

[2] Compared to the business plan from 11labs that offers 120$/1M characters for the multilingual-v2 model as of July 2nd, 2025.