Automatic TTS Model Research

Using Tacotron2 for Realistic Voice Synthesis in Game Development

By: Justin Sabatini

In the development of 'The Last Akaviri,' I found myself grappling with a considerable challenge: introducing realistic and convincing spoken dialogue into my quest. My ambition for the project was to fabricate a DLC-like experience within an archaic and inadequately documented game engine, ensuring the added content appeared as seamless as the main game.


One pivotal component to achieving this immersive atmosphere was a compelling storyline underpinned by credible voice acting. I investigated numerous possibilities, from clipping together existing voice lines to exploring Text-to-Speech (TTS) generators. However, these options presented significant drawbacks. Stitching together pre-existing voice lines proved to be laborious and would not ensure the organic flow of dialogue I sought. Meanwhile, TTS generators I found online either lacked tonal diversity or were shackled by paid subscriptions, and the quality of models in free trials left much to be desired.


Consequently, I pondered if I could engineer my model with a specific timbre to match that of a professional voice actor. My exploration led me to the semi-automatic method of training a voice model from scratch, choosing the Tacotron2 model as the best fit for my requirements. 


Tacotron 2 is a neural network architecture for speech synthesis directly from text. Training a Tacotron model requires two external components: a text manifest of all input audio clips with corresponding spoken words and all audio clips that align with the manifest and model parameters.


My preferred candidate for training data was Wes Johnson, whose substantial in-game dialogue and extraordinary voice-acting prowess made him an ideal choice. To obtain the voice lines, I decompressed the game's voices BSA file, extracting the dialogue audio files into my local directory. Then, I selected the first 100 male Imperial voice lines and used vocal interpretation software to transcribe the spoken words into a text file. With BASH scripts, I edited the file to match the manifest's expectations automatically.


Training the model was initially conducted using a Google Colab notebook, though this presented some obstacles, primarily due to Colab's discontinuation of support for older PyTorch versions. Although I got a model working, the training process was sluggish, prompting me to create a local version of the training model. After the training, the data validation matched the expected results, producing a synthesized voice that was impressive in its realism.


Reflecting on the endeavour, some adjustments would have yielded even better outcomes. For instance, utilizing a more extensive training data set (around 500 lines) likely yielded improved results. Moreover, preprocessing the audio files into shorter sections, removing trailing silence, and categorizing lines based on their delivery could have facilitated the creation of more specific and accurate models. However, special care must be ensured to maintain the streamlined pipeline.


In retrospect, the results obtained were notably impressive, considering the simplistic automation involved in the training data generation. It may be worth exploring a software tool that could automate this process. By enabling the user to input a set of audio files, automatic vocal interpretation and simple file manipulation could streamline the formatting and training of a TTS model.


While such an innovation would primarily benefit modding games with existing vocal lines, creating natural-sounding voices for custom characters, marketing it as general software may be more complex. Unless integrated with vocal assistant user data, the output might be disappointing unless trained on a significantly larger scale. Finally, enhancing TTS might involve integrating more vocal intensity input data into the manifest. The current model tends to extrapolate too much without context, sometimes resulting in uneven dialogue pacing.

AI-Generated TTS Research Demonstration: Simulated Wes Johnson's Voice

Disclaimer: This audio is a product of research and is intended for demonstration purposes only.

Citations

Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018.

for buisness inquires please email justin_sabatini@snazzygamesinc.com