Robust Acoustic Model Training Against Phoneme Variations for Large Vocabulary Continuous Speech Recognition

Gil Ho Lee and Nam Soo Kim


phoneme variation, acoustic model training, speech recognition


In the conventional training of acoustic model (AM), transcriptions are converted into phoneme sequences by using a lexicon. In order to find correct phoneme sequences of transcriptions, forced Viterbi alignment is performed over the training data. If a lexicon has all phoneme variations correspoding to speech signals, phoneme sequences can be represented correclty. However it is impossible to contain all phoneme variations because phoneme variations are partially due to speaker’s characteristic. In this paper, we propose a data-derived robust AM training method against phoneme variations for large vocabulary continuous speech recognition. To reflect speaker’s phoneme variations, we expand a lexicon by replacing a low acoustic scored phoneme with a possible higher acoustic scored phoneme that has been selected from phonetic information. Then, we modify a transcription by substituting the phoneme sequence that has been produced by the expanded lexicon. As a result the ASR system using the proposed method gives the relative word error rate reduction by 9.5% in Korean as compared to the ASR system using the conventional method.

Important Links:

Go Back