Automatic Labeling of Training Data for Singing Voice Detection in Musical Audio

K. Lee and M. Cremer (USA)


singing voice detection, supervised learning, automatic labeling, MIDI, dynamic time warping


We present a novel approach to labeling a large amount of training data for vocal/non-vocal discrimination in musical audio with the minimum amount of human labor. To this end, we use MIDI files for which vocal lines are encoded on a separate channel and synthesize them to create audio files. We then align synthesized audio with real recordings using dynamic time warping (DTW) algorithm. Note on set/offset information encoded in vocal lines in MIDI files provides precise vocal/non-vocal boundaries and we obtain from the minimum-cost alignment path the corresponding boundaries in actual recordings. This near labor-free labeling process allows us to acquire a large training data set, and the experiments show promising results when tested on an independent test set, using hidden Markov models as a classifier. We also demonstrate that the data generated by the proposed system is good data by showing that the overall performance increases with more training data.

Important Links:

Go Back