Significance of spectral cues in automatic speech segmentation for Indian language speech synthesizers

[Journal] Speech Communication: Volume 123, October 2020, Pages 10-25


Arun Baby, Jeena J.Prakash, Aswin Shanmugam Subramanian, Hema A.Murthy


Building speech synthesis systems for Indian languages is challenging owing to the fact that digital resources for these languages are hardly available. Vocabulary independent speech synthesis requires that a given text is split at the level of the smallest sound unit, namely, phone. The waveforms or models of phones are concatenated to produce speech. The waveforms corresponding to that of the phones are obtained manual (listening and marking) when digital resources are scarce. But the manual labeling of speech data (also known as speech segmentation) can lead to inconsistencies as the duration of phones can be as short as 10ms.

The most common approach to automatic segmentation of speech is to perform forced alignment using monophone hidden Markov models (HMMs) that have been obtained using embedded re-estimation after flat start initialization. These results are then used in neural network frameworks to build better acoustic models for speech synthesis/recognition. Segmentation using this approach requires large amounts of data and does not work very well for low resource languages. To address the issue of paucity of data, signal processing cues like short-term energy (STE) and sub-band spectral flux (SBSF) are used in tandem with HMM based forced alignment for automatic speech segmentation.

STE and SBSF are computed on the speech waveforms. STE yields syllable boundaries, while SBSF provides locations of significant change in spectral flux that are indicative of fricatives, affricates, and nasals. STE and SBSF cannot be used directly to segment an utterance. Minimum phase group delay based smoothing is performed to preserve these landmarks, while at the same time reducing the local fluctuations. The boundaries obtained with HMMs are corrected at the syllable level, wherever it is known that the syllable boundaries are correct. Embedded re-estimation of monophone HMM models is again performed using the corrected alignment. Thus, using signal processing cues and HMM re-estimation in tandem, robust monophone HMM models are built. These models are then used in Gaussian mixture model (GMM), deep neural network (DNN) and convolutional neural network (CNN) frameworks to obtain state-level frame posteriors. The boundaries are again iteratively corrected and re-estimated.

Text-to-speech (TTS) systems are built for different Indian languages using phone alignments obtained with and without the use of signal processing based boundary corrections. Unit selection based and statistical parametric based TTS systems are built. The result of the listening tests showed a significant improvement in the quality of synthesis with the use of signal processing based boundary correction.


title = "Significance of spectral cues in automatic speech segmentation for Indian language speech synthesizers",
journal = "Speech Communication",
volume = "123",
pages = "10 - 25",
year = "2020",
issn = "0167-6393",
doi = "",
url = "",
author = "Arun Baby and Jeena J. Prakash and Aswin Shanmugam Subramanian and Hema A. Murthy",
keywords = "Speech segmentation, Signal processing cues, Short-term energy, Sub-band spectral flux, Hidden markov model, Gaussian mixture model, Deep neural network, Convolutional neural network",




Segmentation code