Tacotron 2 Research
Realistic Voice Synthesis in Game Development
AI Voices and Keeping Curse of the Dragonguard True to Oblivion
One of the biggest challenges in game modding is voice acting. It’s often the first thing that breaks immersion, and for Curse of the Dragonguard, I wanted a solution that felt like a natural extension of Oblivion, not a tacked-on addition. That meant finding a way to create new dialogue that blended seamlessly with the original game—without relying on stiff, low-quality text-to-speech (TTS) or the time-consuming process of splicing together old voice lines.
None of the existing options were good enough. Pre-made TTS models lacked the nuance needed to match the game’s tone, and many locked their best features behind paywalls. Editing existing voice clips was tedious and limiting. So, I took another approach—one that even Bethesda never pursued. I trained an AI model to replicate the original voice actors as accurately as possible.
Building an AI Voice Model from Scratch
Wes Johnson, the voice behind many of Oblivion’s Imperial males, was the perfect starting point. His extensive in-game dialogue provided a strong dataset, but getting an AI model to generate convincing voice lines wasn’t as simple as pressing a button. Every step had to be done manually: extracting and transcribing the game’s voice files, formatting them properly, and preparing them for Tacotron 2, a neural network for speech synthesis.
This wasn’t a plug-and-play process. Training an AI model isn’t just about having enough data—it’s about structuring it the right way. The voice files had to be cleaned up, aligned with transcripts, and formatted carefully. Training itself was another hurdle, especially after Google Colab dropped support for older PyTorch versions, forcing a shift to local hardware.
The effort paid off. The AI didn’t just sound like Wes Johnson—it captured the cadence and delivery of an Oblivion NPC. The result wasn’t just a character speaking—it felt like they had always been there.
Refinements and the Future of AI Voices in Modding
There’s still room for improvement. A larger dataset (around 500 lines) would refine accuracy, and better preprocessing—trimming silence, categorizing lines by tone, and filtering for quality—would make the model more adaptable. The real game-changer, though, would be automation: tools that streamline dataset preparation so modders don’t have to do everything by hand.
The real challenge isn’t the technology—it’s how it’s used. AI-generated voices aren’t about replacing creativity; they’re about making new things possible. Without tools like this, modders are stuck recycling the same old voice lines, rather than expanding on the worlds they love.
The future of AI voice in modding isn’t about whether it can be done. It’s about whether creators will use it to push boundaries or stick to the limitations of the past. The tools are here—the only question is how far we’re willing to take them.
AI-Generated TTS Research Demonstration: Simulated Wes Johnson's Voice
Disclaimer: This audio is a product of research and is intended for demonstration purposes only.
Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018.