Damjan Šonc (2011) Trainable speech synthesis system. PhD thesis.
The ultimate goal of speech synthesis is to build a system that could convert arbitrary written messages into intelligible and natural sounding speech. Such a system should also run on hardware platforms that we meet in everyday's life like a personal computer. The solutions that appeared in the last five decades can be divided into three different generations. Unfortunately, even the latest systems from the third generation are far from generating perfectly natural sounding speech. Currently, the best quality of the synthetic speech is obtained from the systems that belong to the group of Unit Selection Synthesis Systems. To build an adequate database of speech units a lot of work from trained engineers is required. The main objective of this Ph.D. thesis was to develop a system that could learn how to produce a high quality synthetic speech from the text and corresponding speech samples only, without requirements for skilled human labor or trained ASR (Automatic Speech Recognition) systems. The system should use statistical, machine learning techniques instead and algorithms for the automatic speech segmentation that do not require ASR. For the purposes of the thesis a prototype of the speech synthesis system named Learn to Speak by Yourself (LSY) was constructed. LSY belongs to the group of Unit Selection Synthesis Systems. The core of the LSY is made of the newly developed algorithm for the automatic speech segmentation that does not require the usage of an ASR system. The algorithm exploits the spectral differences between different phonemes (allophones) of a language. This approach is particularly useful for the Slovene or some other language with a relatively small number of speakers where it is more difficult to find skilled engineers or well trained ASR systems for the speech database construction. The system can start from scratch – i.e. no speech unit database is required. The database is automatically built during learning process. For generation of the speech samples the LSY uses a sinusoidal generator. The statistical results obtained from the listening tests show that synthetic speech produced by the generator in a synthesis by analysis process cannot be distinguished from a natural human speech. We may conclude that in theory a perfectly natural sounding synthetic speech can be produced by LSY. At this time the speech produced by a prototype version of the LSY is highly intelligible but not yet natural sounding. The main reason is the fact that only a few minutes of speech samples were fed to the prototype system while research results found in the literature recommend at least one hour of speech samples and even systems with five hours or more of speech samples are not uncommon. The future work will be concentrated on methods for the automatic extraction of prosody parameters from the speech samples. We would also like to improve the algorithm for the automatic speech segmentation
Actions (login required)