Hybrid statistical static and dynamic pronunciation models designed to be trained by a medium-size corpus
Abstract
Generating pronunciation variants of words is an important applicable subject in speech research and is used extensively in automatic speech recognition and segmentation systems. In this way, decision trees are extensively used to model pronunciation variants of words and sub-word units. In case of word units and very large vocabulary, in order to train necessary decision trees, a huge amount of speech utterances that contain all of the needed words in vocabulary with a sufficient number of repetitions for each one is required; additionally an extra corpus is needed for every word which is not included in the original training corpus and may be added to the vocabulary in the future. To solve these drawbacks, we have designed generalized decision trees, which can be trained using a medium-size corpus over groups of similar words to share information on pronunciation, instead of training a separate tree for every single word. Generalized decision trees predict places in word where substitution, deletion and insertion of phonemes may occur. Next to this step, in order to specifically determine word variants, appropriate statistical contextual rules are applied to the permitted places. The hybrids of generalized decision trees and contextual rules are designed in static and dynamic versions. The hybrid static pronunciation models take into account word phonological structures, unigram probabilities, stress and phone context information simultaneously, while the hybrid dynamic models consider an extra feature, speaking rate, to generate pronunciation variants of words. Using the word variants, generated by static and dynamic models, in the lexicon of SHENAVA Persian continuous speech recognizer, relative word error rate reductions of as high as 8.1% and 10.3% are obtained respectively.
Keywords
Pronunciation Models, Continuous Speech Recognition, Lexicon