Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning

Abstract

Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces VersaVoice, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, VersaVoice introduces two audio tokenizers: (1) a music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a low-frame-rate content-style tokenizer that encodes linguistic content, prosody, and style in human voice while achieving timbre disentanglement. VersaVoice consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching (FM) acoustic modeling stage that allows for timbre control. Particularly, during pre-training of the AR model, we propose both explicit and implicit prosody learning strategies to bridge speech and singing voice. Moreover, to further enhance the AR model's ability to follow text and prosody, we design a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. Experimental results show that unified modeling in VersaVoice brings mutual benefits to both speech and singing voice generation. Additionally, VersaVoice's effectiveness across a wide range of synthesis, conversion, and editing tasks for both speech and singing further demonstrates its strong generalization ability and versatility.

We will showcase VersaVoice's capabilities through the following examples.
Extend Zero-Shot TTS to Singing Voice Domain

Under the standard zero-shot text-to-speech (TTS) inference pipeline, VersaVoice extends its support beyond speech data to include singing voice data as reference prompts, which we define as the text-to-singing task. Notably, even without explicit prosodic contour control (i.e., without using a prosodic source), the synthesized output could follow and extend the reference's melodic pattern, demonstrating melody imitation or continuation.

Note: The prosodic source is not used during the inference.

Versatile Melody Controls for Singing Voice Generation

VersaVoice provides supports for using an explicit prosodic source to control the prosody/melody of the synthesized output. Particularly, it accommodates various forms of prosodic sources, including general speech and singing voice (as demonstrated in the subsequent section of editing tasks), whistle sounds (the first case of the following table), humming voices (the second case), and even instrumental musical sounds (the last two cases).

Note: To accommodate melody control through MIDI score (a standard approach in the conventional SVS task), we can render MIDI as instrumental sound to facilitate inference (as demonstrated in the third case of the table).

Prosody-preserved Speech and Singing Lyric Editing

For editing tasks, VersaVoice not only demonstrates superior performance in synthesized audio naturalness but also effectively preserves the original prosody and melody. For instance, in singing lyric editing tasks (the last four cases of the following table), VersaVoice enables selective modification of lyrics while maintaining the original melodic contour of the singing voice.

Note: In the editing tasks, the raw audio serves simultaneously as the prosodic source, style reference, and timbre reference during inference.

Unified Zero-Shot VC and SVC

To conduct voice conversion (VC) or singing voice conversion (SVC) tasks, VersaVoice exhibits two distinctive characteristics. First, VersaVoice uses a unified content-style tokenizer both speech and singing voice domains. Second, VersaVoice supports both style-preserved and style-converted conversion tasks:

  • When using only the FM stage, i.e., without using the style reference (just as the first and the third cases of the following table), VersaVoice can only convert the timbre but preserve the style of the source audio.
  • When using both the AR and the FM stages, VersaVoice can convert both the timbre and the style of the source audio, resulting in higher similarity to the target speaker(singer).

Note: In the VC task, the prosodic source is not used during the inference. In the SVC task, we use source audio as the prosodic source.

Versatile Speech and Singing Style Conversions

VersaVoice enables versatile speech and singing style conversion tasks. Specifically, the system can utilize different style references (such as different accents, emotions, singing techniques, etc.) to transform the source audio's style, while preserving its original timbre by using itself as the timbre reference:

Note: The source audio serves as the timbre reference in these conversion tasks. On the speech domain (such as the first three cases), the prosodic source is not used. On the singing domain (such as the last three case), the source audio also serves as the prosodic source.