1. Introduction
Vocal effort (VE) was characterized as “the quantity that ordinary speakers vary when they adapt their speech to the demands of an increased or decreased communication distance” [1]. Generally, there are five different VE levels: whispered, soft, normal, loud, and shouted. Changes in VE result in a fundamental change in speech production and then cause the change of acoustic characteristics, which will reduce the accuracy of speech recognition system [2,3]. Therefore, accurate VE detection can enlarge the application range of speech recognition technology, and will promote the practicability of speech recognition. In addition, it also has a positive effect on speaker recognition and speech synthesis [4-7].
It is important for VE detection to find salient information regarding the VE level, and obtain the features which are sensitive to VE change. Because the vocal cords are almost not vibrating when pronouncing, the whispered speech is obviously different in speech production mechanism and acoustic characteristic from the other VE levels. Therefore, as a typical representative of VE, related studies of whisper have been conducted since the 1960s, and the accuracy of whisper detection is satisfactory [8,9]. The average energy ratio between high-energy segment and low-energy segment of low-frequency band is acquired, and the ratio is used as a basis for the judgment of a whisper speech or a normal voice [10]. Zhang and Hansen [11] proposed a detection method of vocal effort change points. In [12], a whisper detection algorithm is proposed using features obtained from waveform energy and wave period. In addition, an accurate whispered speech detection method, which uses auditory-inspired modulation spectral-based features to separate speech from environment-based components, is proposed [13].
For the remaining four VE levels, there are no significant differences in the way of pronunciation, and no significant changes have been reflected in the spectrum. For the two adjacent VE levels, it is even more so. Therefore, just a few studies collectively consider detection of all five speech levels, and only limited performances are provided. In [2,14], spectrum features including sound intensity level, sentence duration, frame energy distribution and spectral tilt are extracted to recognize all VE levels. Experimental results show that the spectrum features have a strong ability to distinguish whisper mode, but have a bad performance when detecting the other VE modes. In [3], a VE classification method using Support Vector Machine (SVM) is proposed based on the Mel-frequency cepstral coefficients (MFCC), and achieves. In addition, a detection method integrating spectrum features and MFCCs is proposed for the identification of VE levels in robust speech recognition [15]. Compared with the spectral features, MFCCs show a stronger ability to distinguish all VE levels. Nevertheless, MFCCs are, after all, proposed specially for speech recognition because they contain salient information regarding speech content rather than regarding the VE level. Thus, MFCCs have limited potential in VE detection.
In order to further improve the detection accuracy of all five VE levels, this paper proposes the spectral information entropy (SIE) which shows a stronger ability in distinguish all VE levels than MFCC and spectral features. On the basis that the spectrum features, MFCCs and SIE describe the speech signal from different aspects and the salient information regarding the VE level they have is not completely overlapping, the three features should be complementary in VE detection. Therefore, this paper proposes a VE detection method based on model fusion, and the method integrates the three features effectively.
This paper is organized as follows. In Section 2, the introduction of spectral information entropy is given. Section 3 introduces the model fusion method based on complementary models. The performance of the proposed method is reported in Section 4. The last Section 5 briefly concludes the work.
2. Spectral Information Entropy
It is important for VE detection to find salient information regarding the VE level, and obtain the features which are sensitive to VE change. In view of the presented drawbacks of the global spectrum features, frame based features, which are able to capture small differences of acoustic properties, are introduced. In addition, MFCC feature is proposed for speech recognition, so this feature mainly reflects acoustic properties caused by different pronunciation instead of VE change. To accurately detect all VE levels, the spectral information entropy feature which contains more salient information regarding the VE level is proposed.
2.1 Feature Extraction
For each frame, the spectrum obtained from FFT can be viewed as a vector of coefficients in an orthonormal basis. Hence, the probability density function (pdf) can be estimated by the normalization over all frequency components. The spectral information entropy can be obtained from this estimated pdf.
Each frame is evenly divided into 6 sub-frames. The SIE of each sub-frame is calculated to form a 6- dimension SIE feature for each frame. In fact, the 6 dimensions of the feature are the spectral information entropy of the 6 sub-bands evenly divided over the frequency range 0–4,000 Hz, respectively. The 6 bands and their range of frequency domain are shown in Table 1.
Six bands and their range of frequency domain
For each sub-band, the spectral information entropy can be obtained as follows: Assuming X(k) is the power spectrum of speech frame x(n), k varies from k1 to kM in a sub-band; then that portion of the frequency content in k band versus the entire response is written as,
Since [TeX:] $$\sum _ { k = k _ { 1 } } ^ { k _ { M } } p ( k ) = 1 , p ( k )$$ has the property of probability. The spectral information entropy for the sub-band can be calculated as,
Using the power spectrum of each frame, the above calculation is performed for each of 6 sub-bands, so the 6-D SIE over the frequency domain is obtained.
2.2 Salient Information Analysis of SIE
From the perspective of speech perception, speech signals are composed of vowels, consonants and silent segments. Obviously, silent segment does not contain salient information regarding the VE level. So we only need to know which contains more salient information between vowel and consonant.
In order to facilitate the analysis, it can be assumed that the speech signal, which shows greater spectrum change when VE level changes, contain more salient information regarding the VE level. For this purpose, an Euclidean distance-based cepstral distance measure DC is used.
where N is the number of SIEs and [TeX:] $$c _ { p } ^ { V E ( j ) } ( i )$$ represents ith SIE coefficient belonging to a phoneme p at vocal effort level VEj. An average distance between all pairs of VE levels for a given phoneme was then computed. After normalization, the obtained average distances for all phonemes are documented in Fig. 1 (sorted in descending order). The highest average distances were obtained for the set of 5 vowels (/a/, /e/, /o/, /i/, /u/) and consonants /j/, /g/, and /y/. These consonants appear less frequently in words, hence the vowels are the best candidates for VE classification.
Sorted average cepstral distances among the 5 VE levels for all phonemes.
The average cepstral distances using MFCC features are also acquired in this paper, and the average cepstral distances are compared with the average cepstral distances of SIE. After the analysis above, we only compare the five Chinese vowels: /a/, /e/, /o/, /i/, and /u/. As shown in Fig. 2, when using SIE features, the average spectral distance of each vowel is higher than that of the MFCC feature. This seems to indicate that SIE contain more salient information regarding the VE level.
Comparison of average cepstral distances between MFCC and SIE for the five vowels.
It is important to keep in mind that the speech samples grouped into the individual VE levels are not results of some artificial signal classification but a genuine representation of what the speakers considered to be “whispering”, “soft speech”, “normal speech”, etc. The histograms represent a measurable physical quantitywhich can be used to measure the difference between different VE levels.
3. Vocal Effort Detection Based on Model Fusion
Section 2.2 has shown that the simple vowels (/a/, /e/, /o/, /i/, /u/) contain more salient information regarding VE level than other phonemes, so they are extracted from speech signal for VE level detection. The simple vowels can be obtained by manual segmentation or vowel endpoint detection. If the simple vowel sequence in sentence S is {v1,v2,...,vn}, Eq. (4) can be obtained,
where SVE represents that the vocal effort level of S is VE, and [TeX:] $$v _ { i } ^ { V E }$$ represents that the VE level of the ith simple vowel is VE. The Eq. (4) means that the VE levels of all simple vowels in S are also VE if the VE level of S is VE, and the converse is also true. Thus, the most likely VE level of S is [TeX:] $$S ^ { V E ^ { * } } = \left\{ v _ { 1 } ^ { V E ^ { * } } , v _ { 2 } ^ { V E ^ { * } } , \cdots , v _ { n } ^ { V E ^ { * } } \right\}$$, and
where F = {f1,f2,...,fn} is the sequence of spectrum feature, M = {m1,m2,...,mn} is the sequence of MFCC feature, and I = {i1,i2,...,in} is the sequence of SIE feature. The Eq. (5) can be transformed to Eq. (6),
where [TeX:] $$\log \left( p \left( v _ { t } ^ { V E } | f _ { t } \right) \right)$$ is the spectrum-VE model score, [TeX:] $$\log \left( p \left( v _ { t } ^ { V E } | m _ { t } \right) \right)$$ is the MFCC-VE model score, [TeX:] $$\log \left( p \left( v _ { t } ^ { V E } | i _ { t } \right) \right)$$ is the SIE-VE model score. α and β are weight coefficients between three models.
Premise condition of the Eq. (6) is that F, M and I are independent. However, the premise condition that the spectrum feature, the MFCC feature and the SIE feature are independent is not established. In order to reduce the computational complexity, [TeX:] $$p \left( S ^ { V E } | F , M , I \right)$$ is generally simplified as [TeX:] $$p \left( S ^ { V E } | F \right) P \left( S ^ { V E } | M \right) P \left( S ^ { V E } | I \right)$$, but this means sacrificing the detection accuracy to some extent.
Instead of above simplified calculation method, another model fusion method which does not rely on the hypothesis that the spectrum feature, the MFCC feature and the SIE feature are independent is proposed. And the Eq. (5) can be transformed as:
Eq. (7) is only deformation of Eq. (5). [TeX:] $$\lambda \cdot p \left( S ^ { V E } | F , M , I \right)$$ is given a new symbol [TeX:] $$\lambda \cdot p _ { 1 } \left( S ^ { V E } | F , M , I \right)$$, and [TeX:] $$( 1 - \lambda ) \cdot p \left( S ^ { V E } | F , M , I \right)$$ is given another new symbol [TeX:] $$( 1 - \lambda ) \cdot p _ { 2 } \left( S ^ { V E } | F , M , I \right)$$. If we use the same method to model p1() and p2(), the Eq. (7) can be achieved by a traditional machine learning method such as Gaussian mixture model (GMM), SVM, artificial neural network (ANN), etc. If we don’t use the same method to model p1() and p2(), and adopt the hypothesis that the spectrum feature, the MFCC feature and the SIE feature are independent, the Eq. (7) can be written as Eq. (6). If different methods are used to model p1() and p2(), and abandon the hypothesis that the spectrum feature, the MFCC feature and the SIE feature are independent, a new model fusion method is obtained to recognize VE level. Thus, the Eq. (7) can be transformed into Eq. (8):
Two different machine learning methods can be used to model p1() and p2() separately, and the two different models are complementary to some extent. The idea of complement is from the observation that confusion occurs in different systems. The complement means that one model can perform VE detection in one way, and the other can also do it in another way. The distributions of their results are overlapping partially, ant not totally same. Thus, there exists complement of effect between them. By means of the model fusion, the spectrum feature, the MFCC feature and the SIE feature are integrated effectively to detect VE level.
4. Experimental Results and Analysis
4.1 Speech Corpora
The data corpus applied in experiments consists of 25000 Mandarin isolated digits (0–9). Twenty male speakers are employed for train set and test set. In the train set, each VE level contains 4000 digits, and each speaker records the digits (0–9) 20 times. In the test set, each VE level contains 1000 digits, and each speaker records the digits (0–9) 5 times. The data corpus is recorded in the laboratory environment, and is stored using the 16 kHz sampling rate and 16-bit resolution.
4.2 Experimental Setup and Result Analysis
In order to find out which kind of feature has more advantages in VE detection, a single type of feature is employed for VE detection, and the spectrum feature, the MFCC feature and the SIE feature are used in turn. This means that [TeX:] $$\log \left( p \left( v _ { t } ^ { V E } | f _ { t } \right) \right) , \log \left( p \left( v _ { t } ^ { V E } | m _ { t } \right) \right)$$, and [TeX:] $$\log \left( p \left( v _ { t } ^ { V E } | i _ { t } \right) \right)$$ in Eq. (7) are used separately for VE detection. The spectrum feature includes sound intensity level,vowel duration, frame energy distribution and spectral tilt which are introduced in [2]. GMM, SVM, and multilayer perceptron (MLP) are selected as detection models. The MLP model has one hidden layer, and the number of hidden nodes is 2N+1, where N is the number of input nodes. LibSVM is used to train the SVM model [16]. The number of mixture components in a GMM is 128, and the diagonal covariance matrices are adopted. The detection results can be seen in Table 2.
Two-stage VE detection results
As can be seen in Table 2, no matter which model is used, the performance when using the SIE feature is the best. And the performance of the spectrum feature is close to the performance of SIE when judging whisper mode. The results indicate the SIE feature proposed is more sensitive to VE change than the spectrum feature and the MFCC feature. Meanwhile, the spectrum feature can provide sufficient salient information regarding whisper level. In addition, the performance of SVM is better than GMM and MLP.
Table 3 shows the performance of various combined models which are integrated according to Eq. (6).The value of α in Eq. (6) ranging from 0.15 to 0.3 and the value of β in Eq. (6) ranging from 0.2 to 0.4 have good effect, and can fuse the detection results of the spectrum-VE model (GMM), MFCC-VE model (MLP), and SIE-VE model (SVM). The GMM model is obtained by the spectrum feature. The MLP model is obtained by the MFCC feature, and The SVM model is obtained by the SIE feature. From Table 3, the combination of different features by the way described in Eq. (6) obtains better performance than each alone for all classifiers.
The performance of combined models
Finally, the proposed method by weighting combination of two classifiers according to Eq. (8) is used for VE detection, and the performance is shown in Table 4. Different from Table 3, all models in Table 4 (GMM*, MLP*, SVM*) are obtained by using the spectrum feature, the MFCC feature and the SIE feature together. The value of γ in Eq. (8) is 1.
The performance of complementary model
Table 4 shows that both the complementary model MLP*/SVM* and the complementary model GMM*/SVM* can achieve better performance than the combined models in Table 3. This means the complementary model based model fusion approach described in Eq. (8) can better integrate different features than the model fusion approach described in Eq. (6). Performance of the complementary model GMM*/MLP* is slightly worse than the combined models in Table 3. The possible reason is that SVM, which has shown strong classification ability, is not used.
5. Conclusion
In this paper, the spectral information entropy feature which contains more salient information regarding the VE level is firstly presented. After analyzing the sensitivity of global spectrum features, MFCC and SIE to the change of VE level, we proposed the model fusion method based on complementary model and yield 81.6% average VE detection accuracy rate.
The future research will be focused on a more precise detection of VE level considering real-world situations (i.e., including additive noise).
Acknowledgement
This paper is supported in part by the China National Nature Science Foundation (No. 61502150, 61300124, and 61403128), Foundation for University Key Teacher by Henan Province (No. 2015GGJS- 068) and the Fundamental Research Funds for the Universities of Henan Province.