A FAST ADAPTATION TECHNIQUE FOR BUILDING DIALECTAL MALAY SPEECH SYNTHESIS ACOUSTIC MODEL

Authors

Yen-Min Jasmina Khaw School of Computer Sciences, Universiti Sains Malaysia, 11800 USM, Penang, Malaysia
Tien-Ping Tan School of Computer Sciences, Universiti Sains Malaysia, 11800 USM, Penang, Malaysia

DOI:

https://doi.org/10.11113/jt.v77.6514

Keywords:

Malay dialect, corpus, dialect adaptation system

Abstract

This paper presents a fast adaptation technique to build a hidden Markov model (HMM) based dialectal speech synthesis acoustic model. Standard Malay is used as a source language whereas Kelantanese Malay is chosen to be target language in this study. Kelantan dialect is a Malay dialect from the northeast of Peninsular Malaysia. One of the most important steps and time consuming in building a HMM acoustic model is the alignment of speech sound. A good alignment will produce a clear and natural synthesize speech. The importance of this study is to propose a quick approach for aligning and building a good dialectal speech synthesis acoustic model by using a different source acoustic model. There are two proposed adaptation approaches in this study to synthesize dialectal Malay sentences using different amount of target speech and a source acoustic model to build the target acoustic model of speech synthesis system. From the results, we found out that the dialectal speech synthesis system built with adaptation approaches are much better in term of speech quality compared to the one without applying adaptation approach.

References

Huang, X. D., Acero, A. and Hon, H-W. 2001. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, New Jersey.

Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addision-Wesley.

Rank, E. and Pirker, H. 1998. Generating Emotional Speech with a Concatenative Synthesizer, ICSLPâ€™98. 671-674.

Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. and Kitamura, T. 1999. Simultaneous Modeling Of Spectrum, Pitch and Duration In HMM-Based Speech Synthesis, Eurospeech. 2347-2350.

K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and T. Kitamura. 2000. Speech Parameter Generation Algorithms For HMM-Based Speech Synthesis. Proc. of ICASSP 2000. 3: 1315-1318, June 2000.

Tokuda, H. Zen, A. W. Black. 2002. An HMM-based Speech Synthesis System Applied to English. IEEE Workshop on Speech Synthesis. 227-230.

K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and T. Kitamur. 2000. Speech Parameter Generation Algorithms For HMM-Based Speech Synthesis. Proc. of ICASSP 2000. 3: 1315-1318, June 2000.

M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi. 2001. Adaptation of Pitch and Spectrum for HMM-based Speech Synthesis using MLLR. In Proc. ICASSP, 2001. 805-808.

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. 1997. Speaker interpolation in HMM-based Speech Synthesis System. In Proc. Eurospeech, 1997. 2523-2526.

M. Tachibana, J. Yamagishi, T. Masuko, and T. Kobayashi. 2005. Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolationand Morphing. IEICE Trans. Inf. & Syst. E88-D(11): 2484-2491.

K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. 2002. Eigenvoices for HMM-based Speech Synthesis. In Proc. ICSLP, 2002. 1269-1272.

T. Nose, J. Yamagishi, and T. Kobayashi. 2006. A Style Control Technique For Speech Synthesis Using Multiple Regression HSMM. In Proc. Interspeech, 2006. 1324-1327.

Asmah Haji Omar. 1991. Aspek Bahasa dan Kajiannya. Kuala Lumpur: Dewan Bahasa dan Pustaka.

Sergio, P. and LuÄ±s Oliveira, C. 2003. DTW-based Phonetic Alignment Using Multiple Acoustic Features, EUROSPEECH 2003 â€“ GENEVA.

Brugnara, F., Falavigna, D. and Omologo, M. 1993. Automatic Segmentation and Labeling of Speech Based on Hidden Markov Models. Speech Communication. 12(4): 357-370.

Sjolander, K. 2003. An HMM-based System For Automatic Segmentation and Alignment Of Speech, Umea University, Department of Philosophy and Linguistics PHONUM. 9: 93-96.

JakovljevicÌ, N., MisÌŒkovicÌ, D., Pekar, D., SecÌŒujski, M. and DelicÌ, V. 2012. Automatic Phonetic Segmentation for a Speech Corpus of Hebrew. INFOTEH-JAHORINA. 11.

Mizera, P. and Pollak, P. 2013. Accuracy of HMM-Based Phonetic Segmentation Using Monophone or Triphone Acoustic Model.

Yuan, J., Ryant, N. and Liberman, M., Stolcke, V. Mitra, and W. Wang. 2013. Automatic Phonetic Segmentation using Boundary Models, in INTERSPEECH. 2306-2310.

Gao, W. and Cao, Q. 2014. Frequency Warping for Speaker Adaptation in HMM-based Speech Synthesis. Journal of Information Science and Engeering. 30: 1149-1166.

Tamura, M., Masuko, T., Tokuda, K. and Kobayashi, T. 1998. Speaker Adaptation for HMM-Based Speech Synthesis System using MLLR.

C. J. Leggetter and P. C. Woodland. 1995. Maximum Likeliood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech and Language. 171-185.

Khaw, J-Y. M. and Tan T. P. 2014. Hybrid Approach for Aligning Parallel Sentences for Languages without a Writteen Form using Standard Malay and Malay Dialect, Asian Language Processing (IALP). 170-174.

Tao, J., Liu, F., Zhang, M. and Jia, H. 2008. Design of Speech Corpus for Mandarin Text to Speech.

Khaw, J-Y. M. and Tan, T. P. 2014. Grapheme To Phoneme for Kelantan Dialect. Cocosdaâ€™14, Phuket, Thailand. 206-211.

Khaw, J-Y. M. and Tan, T. P. 2014. Preparation of MaDiTS Corpus for Malay Dialect Translation and Speech Synthesis System. Proceeding of the 2nd International Workshop on Speech, Language and Audio in Multimedia (SLAM 2014), Penang, Malaysia. 53-57.

Wightman, C. and Talkin, D. 1997. The Aligner: Text to Speech Alignment Using Markov Models. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg (ed.). Progress in Speech Synthesis. Springer Verlag, New York. 313-323.

Davis, S. & Mermelstein, P. 1980. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech and Signal Processing. ASSP-28: 355â€“366. 2.2.1.

Hermansky, H. 1990. Perceptual Linear Predictive Analysis of Speech. The Journal of the Acoustical Society of America. 87: 1738-1752. 1.1, 2.2.1.

Tan, T. P., Xiao, X., Tang, E. K, Chng, E. S. and Li, H. 2009. Mass: A Malay Language LVCSR Corpus Resource, Cocosdaâ€™09, Beijing. 10-13.

Goronzy, S. and Kompe, R. 1998. Speaker Adaptation of HMMs using MLLR. Proceedings of SRF.

Kompe, R. and Goronzy, S. 1998. MAP Adaptation of an HMM Speech Recognizer. Proceedings of SRF.

Downloads

Published

2015-11-30

Issue

Vol. 77 No. 19: Intelligence and Interactivity for Future Computing Vol. 2

Section

Science and Engineering

License

Copyright of articles that appear in Jurnal Teknologi belongs exclusively to Penerbit Universiti Teknologi Malaysia (Penerbit UTM Press). This copyright covers the rights to reproduce the article, including reprints, electronic reproductions, or any other reproductions of similar nature.