How we made state-of-the-art speech synthesis scalable with
Modular
We’ve partnered with Modular to supercharge
Inworld TTS, combining our
state-of-the-art voice quality with Modular's world-class serving
stack to deliver breakthrough speed and affordability for every
developer.
The challenge
Real-time AI applications, particularly speech synthesis, demand more
than traditional machine learning infrastructure can provide. The
computational intensity of generating realistic, low-latency speech
makes it a significant challenge. To enable scalable and economically
viable voice AI applications, specialized APIs are essential. But
high-quality voice APIs are usually either expensive, slow, or both,
while cheaper and faster alternatives often lack realism. So we had to
design our TTS to eliminate this trade-off.
Solving this required more than just optimizing our model; it demanded a
fundamental redesign of the entire inference stack. This is where our
collaboration with Modular began.
The solution
Our partnership with Modular represents a co-engineered approach, where
both companies’ engineering teams worked together to make Modular’s MAX
Framework serve our proprietary text-to-speech model. Together we went
from engagement to the world's most advanced speech pipeline on
NVIDIA Blackwell 200 GPU in less than 8 weeks.
By using MAX we achieved a truly remarkable improvement both for the
latency and throughput. In the streaming mode, the API now returns the
first two seconds of synthesized audio on
average ~70% faster if compared to vanilla vLLM-based
implementation. This allowed us to serve more QPS with lower latency and
eventually offer the API at a ~60% lower price than would have
been possible without Modular's stack .
Let’s dive deeper into how this was done by talking about modeling first
to better understand what exactly we had to optimize.
TTS architecture
Inworld TTS relies on adapting and scaling a cutting-edge,
open-source-inspired tech stack. The model architecture is a
Speech-Language Model (SpeechLM) built upon an in-house neural audio
codec and an LLM backbone.
Two-Stage TTS Architecture
Its training process involved three key stages:
-
We pre-trained the model on a massive dataset of ~1 million hours of
diverse audio and ~200 billion text tokens to build a foundational
understanding of human speech, prosody, and language.
-
The model was then refined through supervised fine-tuning on ~200,000
hours of high-quality annotated audio to teach it performing speech
synthesis.
-
Finally, we used RLHF on a curated multilingual dataset to enhance the
model's stability.
Ultimately, this training approach produced a model that is inherently
robust, covering the diverse voices and challenging edge cases found in
the real world.
This straightforward two-component model design the system to be streaming-ready out-of-the-box,
which is essential for real-time use
cases. When streaming, clients receive the first meaningful audio
segment (approximately two seconds long) within a median latency as low
as 200ms. This chunk can then be played-back simultaneously, while
waiting for the rest of audio to be synthesized, which usually takes
less time than the initial playback.
The tricky part is that the two components have different performance
characteristics, which can lead to bottlenecks. For example, the audio
decoder can get stuck waiting for the SpeechLM to produce tokens,
wasting valuable GPU time. To overcome this, the Modular team worked
closely with us to port the original implementation - thousands of lines
of PyTorch modeling code - into MAX’s Python-based graph format. This
enabled efficient, end-to-end execution of the full TTS pipeline with
high throughput and low latency on NVIDIA Blackwell GPUs.
The Inworld TTS Serving Stack
At the heart of the Inworld TTS serving stack are two foundational
technologies that are part of the Modular Platform:
-
🧑🏻🚀 MAX is a universal high-performance AI serving framework built
to deliver state-of-the-art performance across GPU and CPU
platforms.
It combines advanced batching, graph-level optimizations, memory
planning, and fine-grained kernel scheduling in a single containerized
runtime with an OpenAI-compatible serving endpoint.
-
🔥 Mojo is a systems programming language focused on AI
kernels. It bridges the gap between Python's ease of use and
speed-of-light performance, offering fine-grained control over memory
layout, parallelism, and vectorized execution. Mojo serves as the
foundational layer for writing kernels and graph transformations
within MAX - making the entire stack programmable and portable,
without sacrificing efficiency.
MAX is a vertically integrated, highly-performant AI inference
stack
MAX and Mojo enabled deep system-level optimization across the TTS
stack. MAX’s streaming-aware scheduler, designed to minimize
time-to-first-audio, combined with its optimized kernel library to
deliver a ~1.6X speedup on SpeechLM. Additional system-level gains came
from enabling faster data types in the audio decoder, overlapping
dependent kernel execution, and optimizing both sampling logic and
memory allocation.
Mojo adds another layer of performance and flexibility through its
ability to define high-efficiency custom kernels. Using this, to
accelerate streaming, we introduced a tailored silence-detection kernel
that runs directly on the GPU, enabling on-device output processing and
maximizing GPU utilization. We’re also actively exploring additional
enhancements to streamline intermediate token processing - specifically,
eliminating redundant memory transfers between the SpeechLM and audio
decoder, which will further reduce latency and improve throughput.
What this means for your applications
Inworld TTS already uses this optimized infrastructure in production
with measurable improvements in both user experience and operational
efficiency. This validation was crucial as it demonstrated that the
Modular Platform can deliver on its promises in real-world scenarios.
When you build with Inworld, your end users get the direct benefits of
our Modular collaboration:
-
Deliver truly instant interactions. Thanks to
MAX's streaming-aware scheduler, your application gets the first
chunk of audio in as little as 200ms, eliminating awkward pauses and
keeping users immersed.
-
Scale your application without fear of cost. By
optimizing the entire stack for high throughput, we cut the price by
~60%. You can now serve more users and deploy rich voice experiences
at a cost that is ~20x lower than alternatives .
-
Ensure seamless performance under load. Our
architecture is built for high load, ensuring your application
can serve any QPS you need. The user experience remains seamless and
responsive, even during traffic spikes.
We'll soon share exciting stories about how our customers are
winning due to our partnership with Modular.
The future of AI infrastructure
Our collaboration with Modular is a glimpse into the future of
accessible AI infrastructure. We
envision a
world
where developers can focus on delivering the best user experience for
their AI applications without worrying about the underlying complexities
of hardware optimization, vendor lock-in, or infrastructure scalability.
The improvements we've demonstrated together prove that strategic
partnerships can accelerate innovation across the entire AI ecosystem.
By combining Modular's infrastructure expertise with our modeling
solutions, we're pushing the boundaries of what's possible in
AI. More is yet to come, so stay tuned!
Appendix