Advanced Tacotron 2-Based Voice Synthesis for Game Modding

Multimodal Text-to-Speech (TTS) Research for Curse of the Dragonguard

Voice acting is critical to immersive storytelling, especially in expansive, narrative-driven games like Bethesda's The Elder Scrolls IV: Oblivion. Traditional modding faces major limitations in providing seamless dialogue expansions due to the constraints of manually recording or reusing existing voice lines. The Curse of the Dragonguard project addresses this challenge by leveraging advanced neural voice synthesis technology.

Research Motivation and Goals

The objective of this research is to create a multimodal Text-to-Speech (TTS) pipeline capable of generating highly realistic, expressive character dialogue that matches the original voice actors' delivery. By utilizing AI-driven TTS techniques, modders can dynamically generate context-sensitive speech, enriching the player’s experience without breaking immersion.

Technical Approach: Tacotron 2 + WaveNet

Shen, Jonathan, et al. "Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions." 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).

Our solution utilizes a combination of Tacotron 2 and WaveNet architectures:

Tacotron 2: A neural network designed for end-to-end speech synthesis, which converts text into high-quality Mel spectrograms using location-sensitive attention mechanisms and bidirectional LSTM layers.
WaveNet: A deep generative model employing convolutional neural networks to transform Mel spectrograms into highly realistic audio waveforms.

This hybrid approach allows for:

Accurate replication of original voice actor characteristics (e.g., Wes Johnson’s Imperial Male voices).
Rich, nuanced vocal expressions, including emotional intonations and natural speech cadence.
Scalability and adaptability for diverse characters with limited datasets.

Dataset and Training

A structured dataset of approximately 60 hours of lossless audio extracted from Oblivion’s game files forms the backbone of this research. Audio preprocessing involves:

Precise transcription and text normalization
Silence trimming and alignment for accurate model training
Categorization of voice lines by tone, emotion, and speaker identity

Current Results and Demonstration

Preliminary training using Wes Johnson’s character lines has demonstrated significant success. Below is a synthesized voice sample generated using the developed Tacotron 2 + WaveNet pipeline:

AI-Generated Voice Sample:

Disclaimer: This audio is a product of research and is intended for demonstration purposes only.

Future Directions

Future research will focus on:

Enhancing dataset preparation through automated preprocessing tools
Exploring zero-shot speaker cloning techniques for broader voice applicability
Integrating contextual multimodal inputs (game events, emotional tags) to dynamically adapt dialogue synthesis

Broader Implications

Beyond modding, this research has implications for interactive media development, providing insights into creating adaptive dialogue systems for games, virtual environments, and narrative AI applications.

Page updated

Report abuse