Kade Tuerxun


ALBert, Chinese speech synthesis, FastSpeech2, Transformer structure, GAN


To improve the speech synthesis (SS) technology in speech navigation APP, and to improve its SS quality and synthesis speed, the study proposed ALBert multi-syllable disambiguation method and used it in text-phoneme conversion processing. And the study also constructed a non-autoregressive Chinese SS technique based on Transformer. The research indicates that ALBert possesses the optimum disambiguation outcome, with an average accuracy of 94.2% for its polyphonic character disambiguation, 83.4% for maximum entropy model (MEM) algorithm, 83.7% for tree- guided transformation-based learning (TGTBL) algorithm, 84.3% for pypinyin tool library, and 87.1% for conditional random fields (CRF). Among the common polysyllabic words, “chao” has the highest recognition accuracy of 98.5%, and “wei” has the highest frequency of 11%. The highest performance of the FastSpeech2-GAN model is achieved at 100 k training steps, with a mean opinion score (MOS) of 3.94 and a Mel Cepstral distance (MCD) of 2.8911. The MOS scores and MCD values of the SS models are compared. The MOS score of FastSpeech2-GAN model is 3.94, and the MCD value is 2.8911, followed by FastSpeech2 model with MOS score of 3.88 and MCD value of 2.9168. 0.011, and FastSpeech2 has the same real-time rate. The studied improved Transformer-based non- autoregressive Chinese SS technology has made some progress in SS speed and SS quality.

Important Links:

Go Back