Robert Rozman (2005) Asymmetric window functions in speech recognition systems. PhD thesis.
PDF (Disertacija) Download (1705Kb) | |
PDF (Disertacija - naslovnice) Download (57Kb) | |
PDF (Disertacija - povzetek) Download (86Kb) |
Abstract
The fundamental problem of automatic speech recognition is the variability of speech signals. Each written word has several possible spoken variants. In addition, the speech signal is often distorted which results in a reduced success rate of speech recognition systems (SRS). Undesired influence of distortions is addressed in different ways. The most common procedure is inclusion of expected distortions in a training phase. In practice this is difficult to do because of diversity of audio devices, channels and acoustical environments that take part in spoken communications. This means that in real applications recognition systems meet with different types of distortions. In such situations it is very important for them to maintain their success rate as much as possible. Symmetric windows are widely used in the field of digital signal processing due to ease of design and linear phase property. But it is a well known fact that asymmetric windows can be, despite their nonlinear phase, better in other properties. In speech recognition this can lead to a more robust signal representation from the parametrization process. A shorter time delay can also be achieved. The last property is gaining importance in contemporary spoken communications. Human listeners perform substantially better than SRS in the presence of distortions. Therefore more and more properties of human hearing are taken into consideration when SRS are designed. But inside SRS there still exist several details that disregard human properties. Symmetric window function in frequency analysis of speech is certainly one of them. Human speech perception is quite insensitive to phase distortions of speech signal - this fact is disregarded when symmetric windows are used in existing SRS. Design of asymmetric windows and even more their use in SRS are not thoroughly researched. Little is known of the influence of window properties on success rate or robustness[1] of SRS. However, the popularity of asymmetric windows in speech coding suggests that their advantages could be applied to practical systems. Our previous research on application of asymmetric windows to speech recognition confirmed a noticeable increase in recognition robustness. These facts contributed to motivation for this thesis. Its basic purpose is enhancement of knowledge about asymmetric windows and their influence on the robustness of SRS. The first part of the thesis gives a general overview of automatic speech recognition and of reference testing environment (two SRS and two speech data bases) that was used for practical experiments. In the second part the possibilities for asymmetric window design are discussed. Their properties and influence on SRS robustness are examined. We start with potential advantages of asymmetric window properties and their expected positive influence on recognition robustness. Since very little is known about this properties we used knowledge from the related fields of general frequency analysis and automatic speech recognition. On such basis some typical asymmetric windows were designed, each with an enhanced selected property. It is important to note that the SRS success rate was used as a criterion. Due to diversity of design possibilities, the use of the general optimization methods was carefully examined. These methods provide a framework for manipulation of different criterions and desired properties of optimal solutions. Several methods for the design of asymmetric windows were tried. We started with methods for the design of FIR digital filters. Despite good results, these methods proved to be computationally too demanding. We therefore examined two additional groups of simpler methods. These solutions are less flexible and have a specific shape of amplitude response, but the design procedures are much faster. The first group relies on a parametric model with properties that are very close to those of human hearing (IIR family of windows). This model reduces the number of free system parameters. The resulting windows however, are in some respects even better than those of FIR filters methods. The second group of methods is based on simple operations that are applied to symmetric windows. Their properties are merged giving an asymmetric window. The main advantage of this method is reuse of knowledge about the symmetric windows. This approach was used to define the ITU asymmetric window that is popular for speech coding because of shorter time delay. The ITU window was tested on speech recognition although the results were not promising. Our approach to the design of asymmetric windows was to first examine their properties from a general frequency analysis viewpoint. This was followed by experiments that gave results of their influence on the robustness of SRS. To ensure the generality of conclusions we used two different SRS implementations. The first one is based on the statistical approach using the Hidden Markov Models. The second system uses neural network in the form of a multi-layered perceptron. All experiments were carried out on two different speech data bases: one in English and one in Slovene. The architectures for both systems were left unchanged and standard procedures for calculation of MFCC features were used. This reduced the influence of less important factors. Practical evaluation showed a considerable increase of robustness for both SRS when asymmetric windows were applied in comparison to the standard symmetric Hamming window. It should be stressed that replacing a window function is a simple procedure that does not increase the time or space complexity of a SRS. Results of sequential tests confirmed a similar increase of robustness even on SRS that were trained using the Hamming window. These effects make the windows replacement routine much simpler and cheaper in comparison to the other procedures for enhancing robustness. Practical experience with implementations of SRS shows great importance of robustness. It is reasonable to expect that the importance of the robustness will become even more important in the future because of growing diversity of acoustical environments. Robustness of automatic speech recognition and procedures for its enhancement will therefore remain an important research target. -------------------------------------------------------------------------------- [1] Robustness means successful recognition even in presence of distortions.
Item Type: | Thesis (PhD thesis) | ||||||
---|---|---|---|---|---|---|---|
Keywords: | |||||||
Language of Content: | Slovenian | ||||||
Mentor / Comentors: |
| ||||||
Link to COBISS: | http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=19055399) | ||||||
Institution: | University of Ljubljana | ||||||
Department: | Faculty of Computer and Information Science | ||||||
Item ID: | 772 | ||||||
Date Deposited: | 11 Dec 2008 16:58 | ||||||
Last Modified: | 13 Aug 2011 00:34 | ||||||
URI: | http://eprints.fri.uni-lj.si/id/eprint/772 |
Actions (login required)
View Item |