Andrej Štrancar (2006) Dynamic Adaptation of Time-Frequency Resolution in Spectral Analysis of Speech Signals. PhD thesis.
Abstract
In speech parametrization the speech signal spectrum is calculated with a frame of typical length between 15 and 35 msec. This results in a uniform time-frequency resolution that does not conform well to the properties of human hearing. One of the hearing properties is nonlinear frequency resolution which can be approximated with multiresolution spectral analysis. In this thesis the continuous wavelet transform was tried as a possible approach for MFCC extraction which is currently the most common type of parametrization. The resulting spectrum, similar to the human hearing, has the frequency resolution that is better at lower frequencies and gets coarser with increasing frequency. This results in better time resolution at higher frequencies which allows spectral changes to be detected more precisely if they are (also) present at higher frequencies. Lower frequency band can be analyzed with fine frequency resolution at the same time. The comparison of success rate achieved with the same speech recognition systems did not show any advantages of wavelet transform based MFCCs over standard MFCCs. Because computing continuous wavelet transform is computationally quite intensive, discrete wavelet transform based MFCCs were also tested. The use of discrete wavelet transform resulted in a significantly decreased success rate. The use of wavelet transform does not solve the problems related to nonstationarity of speech signal, as the time-frequency resolution of its spectrum is not time dependent. In this dissertation, three approaches for dynamic adapting time-frequency resolution were presented. In every approach, one has to estimate how rapidly the spectrum is changing at a given time. This estimation can be based on known facts about the structure of speech or about production of speech. In the first presented approach, the adaptive time-frequency was achieved by varying the frame length based on the phonetic structure of the speech. For each phoneme, the basic properties of spectrum are known. The spectrum of vowels and some other long phonemes is almost stationary, but spectrum of other phonemes, such as stops changes rapidly. If phonetic structure of speech is known, the time-frequency can be adapted by using appropriate frame length for each phoneme. In speech recognition, the phonetic structure is not known. Therefore, speech recognition needs to be done in two passes. Phonetic structure is unknown in the first pass and a fixed frame length is used for parametrization. In the second pass, the phonetic structure from the first pass is known, and the frame length is selected on its basis. In the second presented approach, the time-frequency resolution was adapted according to Moore’s formula, which describes human’s perception of intensity changes in speech signal. Most of intensity changes are related to sections of speech where temporal resolution is more important than frequency resolution. Larger intensity changes are related to short phonemes, such as burst release in plosives. Intensity changes are also related to phoneme transitions. Therefore, when intensity changes are high, the wideband spectrum is emphasized and when they are low narrowband spectrum is emphasized. Computing intensity changes is far less computationally intensive than determining the phonetic structure in an additional pass. The third approach is based on recognition of voiced and unvoiced speech segments. When voiced speech is produced, the vocal folds need to be closed to obstruct the airflow. Because voiced and unvoiced segments are determined by opening or closing the vocal folds, a voiced segment cannot be very short. Most of voiced phonemes are long and have almost stationary spectrum. In feature extraction longer frame was used on voiced segments and shorter frame on unvoiced segments. All of the three above-mentioned approaches to dynamic time-frequency resolution adapting were tested with the same speech recognition system with two speech databases. Several additive and two convolutive distortions were used to test the robustness. In our experiments, adapting frame length based on phonetic structure of speech proved to be too complicated. It is computationally demanding, and was only tested with the smaller speech database. The success rate was almost unchanged, and robustness decreased slightly in comparison to the original speech recognition system which uses standard MFCCs. Adapting time-frequency resolution to intensity changes resulted in increased success rate and robustness. The improvement was quite large and very consistent. Adapting the frame length according to voiced and unvoiced speech segment improved the robustness and in some experiments the success rate.
Actions (login required)