MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance

Authors: Anonymous Authors

Abstract

In this paper, we propose MakeSinger, a semi-supervised training method for singing voice synthesis (SVS) via classifierfree diffusion guidance. The challenge in SVS lies in the costly process of gathering aligned sets of text, pitch, and audio data. MakeSinger enables the training of the diffusion-based SVS model from any speech and singing voice data regardless of its labeling, thereby enhancing the quality of generated voices with large amount of unlabeled data. At inference, our novel dual guiding mechanism gives text and pitch guidance on the reverse diffusion step by estimating the score of masked input. Experimental results show that the model trained in a semisupervised manner outperforms other baselines trained only on the labeled data in terms of pronunciation, pitch accuracy and overall quality. Furthermore, we demonstrate that by adding Text-to-Speech (TTS) data in training, the model can synthesize the singing voices of TTS speakers even without their singing voices.


Semi-supervised SVS

The proposed model was trained on 12, 36, 108 songs with text and pitch labels, and rest of the songs with no labels (12 singers).

Audio samples

Text

어디로 떠났나 다정한

Vocoder Reconstruction

(1) 12 labeled songs

MLP-Singer

VISinger

w/o Semi-Supervision

with Semi-Supervision

(3) 36 labeled songs

MLP-Singer

VISinger

w/o Semi-Supervision

with Semi-Supervision

(2) 108 labeled songs

MLP-Singer

VISinger

w/o Semi-Supervision

with Semi-Supervision

내려와 거리를 떠돌며

시간으로 돌아가고픈


SVS using TTS data

The proposed model was trained on 487 labeled songs, 486 unlabeled songs, 3hours of TTS data(80 speakers).

Audio samples

Text

왜 이리도 이 세상엔 이별이 많은지

Reference Speaker

Reference Song

Proposed

누구도 이처럼 원한 적 없죠

너도 가끔씩은