Introduction:

Vowel formants (F1, F2, F3 etc.) are prominent frequencies in human voice spectra which are said to distinguish different vowels from each other. However, formant frequencies of one and the same vowel vary in pronunciation of different speakers in a very large extent. How then do we recognize and distinguish vowels?

The Formant Ratio Theory (Potter and Steinberg, 1950; Nearey, 1977) suggest that each vowel category is characteristic of specific ratios between its formants, which ratios do not depend on absolute formant frequencies in Hz. Say, frequencies F1 and F2 of vowel [i] may differ from case to case and from person to person in the range of 190 Hz to 590 Hz for F1 and 2000 Hz to 3610 Hz for F2, whereas the ratio between the two frequencies is all the time maintained constant: F1 : F2 = 1 : 9.

Not only gives the Formant Ratio Theory a feasible explanation to otherwise incomprehensible phenomenon, it draws an important parallel to human perception of musical intervals and chords, which are recognized and distinguished by the ratios between the compounding tones, too. The theory, however, has been challenged (Bondarko 1984; Johnson 2008), the variance of ratios within each vowel category considered too great to support the theory.

Is there any significant difference in the formant ratios across different vowels types indeed? If yes, is this inter-group variance bigger than the variance within each vowel category? Are vowels subject to the same laws of human hearing that rule in the realm of music? These are the questions we are going to check statistically.

Data:

Data Collection: The data was collected by Peterson and Barney and used in their 1952 JASA paper. Peterson and Barney measured the frequency of F0, F1, F2 and F3 for 10 vowels with 76 speakers. This how they describe the process of data collection: "A list of words was presented to the speaker and his utterances of the words were recorded with a magnetic tape recorder. The list contained ten monosyllabic words each beginning with [h] and ending with [d] and differing only in the vowel. The words used were heed, hid, head, had, hod, hawed, hood, who'd, hud, and heard. The order of the words was randomized in each list, and each speaker was asked to pronounce two different lists. The purpose of randomizing the words in the list was to avoid practice effects which would be associated with an unvarying order. "A total of 76 speakers, including 33 men, 28 women and 15 children, each recorded two lists of 10 words, making a total of 1520 recorded words. Two of the speakers were born outside the United States and a few others spoke a foreign language before learning English. Most of the women and children grew up in the Middle Atlantic speech area. The male speakers represented much broader regional sampling of the United States; the majority of them spoke General American". (Peterson and Barney, 1952)

Cases and variables: The cases (experimental units) include 1520 recorded vowels, each case described in 3 categorical variables - Gender (two levels: male, female), Type (3 levels: men, women, children), Vowel (10 levels: 1,2,3,4,5,6,7,8,9,10), or Signs (the same 10 levels: R, i, I, E, A, a, o, V, U, u), and 4 continuous numerical variables F0, F1, F2, F3. We are going investigate the relation between the "Vowel" (a 10-level independent categorical variable) and the "Ratio" (a dependent continuous numerical variable derived from division of F1 and F2).

Type of study: The study should be considered as a retrospective observational one: individuals were surveyed on how they pronounce vowels and they were not divided into treatment or control groups and no external variables were manipulated. The setting of the study was, however, aimed to provide most control over the target variables (gender, which translates into the fundamental frequency F0, and age, which translates into the vocal tract length) and to reduce the influence of such confounding factors as regional and social language variance, effects of co-articulation and measurement error.

Generalizability: The sampling of readers is no way representative of the population of the United States nor geographically (Mid Atlantic speech area), neither socially (no much detail on this, however, most probably, volunteered students, colleagues and their family members - college-educated upper-class), and it was not supposed to. Due to great many factors influencing vowel spectra and general volatility of phonetic data, the only way to reduce a sample size down to a reasonable number AND to obtain statistically feasible results is to sample from a linguistically homogeneous group, i.e from speakers of the same social and regional dialect. Blocking provided necessary stratification of the sample (33 men, 29 women, 15 children - 7 boys and 8 girls). The sample can be considered representative of a specific dialectal group and the results can be generalized for the whole dialect.

Causality: Whether formant ratios (F1:F2, F2:F3) define vowel classes - or vowel classes defines the ratios? There believed to be no casual relationship between a sign (a sound) and the meaning (a vowel class). We are going to explore if there is (HA) or there is not (H0) a significant correlation between a F1:F2 ratio (a sign) and a vowel class (a meaning).

Exploratory data analysis:

The summary statistics might give a general feeling of the data:

summary(data)
##       row         Type    Sex        Speaker         Vowel    
##  Min.   :   1.0   c:300   f:720   1      :  20   1      :152  
##  1st Qu.: 380.8   m:660   m:800   2      :  20   2      :152  
##  Median : 760.5   w:560           3      :  20   3      :152  
##  Mean   : 760.5                   4      :  20   4      :152  
##  3rd Qu.:1140.2                   5      :  20   5      :152  
##  Max.   :1520.0                   6      :  20   6      :152  
##                                   (Other):1400   (Other):608  
##      Signs           F0              F1               F2      
##  a      :152   Min.   : 91.0   Min.   : 190.0   Min.   : 560  
##  A      :152   1st Qu.:133.0   1st Qu.: 420.0   1st Qu.:1100  
##  E      :152   Median :200.0   Median : 540.0   Median :1470  
##  i      :152   Mean   :191.3   Mean   : 563.3   Mean   :1624  
##  I      :152   3rd Qu.:238.0   3rd Qu.: 681.0   3rd Qu.:2100  
##  o      :152   Max.   :350.0   Max.   :1300.0   Max.   :3610  
##  (Other):608                                                  
##        F3      
##  Min.   :1400  
##  1st Qu.:2370  
##  Median :2680  
##  Mean   :2708  
##  3rd Qu.:3030  
##  Max.   :4380  
## 

The scatterplot of absolute values of F1 and F2 shows significant variance of the values within each group, yet the overlapping clouds of different colors corresponding to different vowel categories stretch along the lines going from the base point, which is indicative of a correlation between formant ratios (here - coefficients of the slope of each line) and vowel categories.

qplot(data=data, F1, F2, color=Signs, xlim=c(0,1500),ylim=c(0,4000),main="F1 versus F2 plot",xlab="F1, Hz", ylab="F2, Hz")

The parallel boxplots of formant ratios for each vowel category show the medians of ratios and their variance across and within the categories. The two ratios (F1:F2) and (F2:F3) form a unique combination for each vowel type.

plot(data$F2/data$F3~data$Vowel, ylim=c(0,1), xlab="Vowel type", 
     ylab="Formant ratios", names=c("R", "i", "I", "E", "A", "a", "o", "V", "U", "u"), col="grey", 
     main="Vowel formant ratios boxplot")
par(new=T)
plot(data$F1/data$F2~data$Vowel, ylim=c(0,1), xlab="Vowel type", 
     ylab="Formant ratios", names=c("R", "i", "I", "E", "A", "a", "o", "V", "U", "u"), col="pink")
legend("bottomright", c("F1:F2", "F2:F3"), col=c("pink","grey"), pch=16)

Is the correlation observed statistically significant, given the dispersion we see? Does the variance of each category exceed the variance between them? This is yet to be checked with ANOVA test.

Inference:

Hypothesis:

H0: F1:F2 = G1:G2 - there is no difference in formant ratios between different vowel categories, all vowel types are equal in respect of formant ratios.

HA: F1:F2 != G1:G2 - there is significant difference in formant ratios between different vowel categories, vowel types are different in respect of formant ratios.

Conditions: The speakers represent a homogeneous group from Mid Atlantic speaking area representative of Standard Mid Atlantic English dialect, which is of most importance for statistical inference in our case. Regional and social difference between the speakers act as a strong confounding factor in the cross-gender and cross-generation phonetic study; and the non-quite-random block selection almost eliminated this source of bias, providing the data for a linguistically homogeneous group of speakers of different age and gender. The otherwise observational study is designed in such a way (standard vowel context, supervised reading, standard instrumentation and measurement) as to cope with minimum confounding variables at a time.

Methods: why and how. We have to compare the variance of a continuous numerical variable across several levels of a categorical variable, namely the means of F1/F2 (dependent numerical variable) of each of 10 vowels (10 levels of the independent categorical variables) and see whether the difference between them is significant or not with ANOVA test. The test shows that there is at least on pair with significant difference.

anova=aov(data=data,F1/F2~Vowel)
summary(anova)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## Vowel          9  49.01   5.446    1526 <2e-16 ***
## Residuals   1510   5.39   0.004                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The pairwise test compares the variance across the groups and shows that the difference is significant across ALL pairs of groups.

pairwise.t.test(data$F1/data$F2, data$Vowel)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$F1/data$F2 and data$Vowel 
## 
##    1       2       3       4       5       6       7       8       9      
## 2  < 2e-16 -       -       -       -       -       -       -       -      
## 3  < 2e-16 < 2e-16 -       -       -       -       -       -       -      
## 4  < 2e-16 < 2e-16 < 2e-16 -       -       -       -       -       -      
## 5  < 2e-16 < 2e-16 < 2e-16 < 2e-16 -       -       -       -       -      
## 6  < 2e-16 < 2e-16 < 2e-16 < 2e-16 < 2e-16 -       -       -       -      
## 7  < 2e-16 < 2e-16 < 2e-16 < 2e-16 < 2e-16 3.1e-05 -       -       -      
## 8  < 2e-16 < 2e-16 < 2e-16 < 2e-16 < 2e-16 < 2e-16 < 2e-16 -       -      
## 9  < 2e-16 < 2e-16 < 2e-16 < 2e-16 0.46    < 2e-16 < 2e-16 < 2e-16 -      
## 10 5.8e-10 < 2e-16 < 2e-16 < 2e-16 2.7e-06 < 2e-16 < 2e-16 < 2e-16 6.7e-08
## 
## P value adjustment method: holm

We are also going to build a series of linear regression models adapted for analyzing categorical variable and conducting ANOVA for different models, which will include, along with the coefficients of each vowel category, those for types and gender, - the variables that are significant in the models predicting one absolute formant value from another: lm(data, F1 ~ F2 + Vowel + Sex) or lm(data, F1 ~ F2 + Vowel + Type).

fit0=lm(data=data, F1/F2~Vowel)
summary(fit0)
## 
## Call:
## lm(formula = F1/F2 ~ Vowel, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.21308 -0.03561 -0.00068  0.03040  0.33533 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.333297   0.004845  68.789  < 2e-16 ***
## Vowel2      -0.218971   0.006852 -31.956  < 2e-16 ***
## Vowel3      -0.142966   0.006852 -20.864  < 2e-16 ***
## Vowel4      -0.059850   0.006852  -8.734  < 2e-16 ***
## Vowel5       0.078293   0.006852  11.426  < 2e-16 ***
## Vowel6       0.361086   0.006852  52.696  < 2e-16 ***
## Vowel7       0.331371   0.006852  48.360  < 2e-16 ***
## Vowel8       0.203869   0.006852  29.752  < 2e-16 ***
## Vowel9       0.083355   0.006852  12.165  < 2e-16 ***
## Vowel10      0.044473   0.006852   6.490 1.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05974 on 1510 degrees of freedom
## Multiple R-squared:  0.901,  Adjusted R-squared:  0.9004 
## F-statistic:  1526 on 9 and 1510 DF,  p-value: < 2.2e-16
fit1=lm(data=data, F1/F2~Vowel+Type)
fit2=lm(data=data, F1/F2~Vowel+Sex)
anova(fit0, fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: F1/F2 ~ Vowel
## Model 2: F1/F2 ~ Vowel + Type
## Model 3: F1/F2 ~ Vowel + Sex
##   Res.Df    RSS Df  Sum of Sq      F Pr(>F)
## 1   1510 5.3883                            
## 2   1508 5.3778  2  0.0104644 1.4672 0.2309
## 3   1509 5.3856 -1 -0.0077695 2.1786 0.1401

The linear models including coefficients of vowels and either gender or type show significance of vowel class and insignificance of age and gender for vowel ratio prediction. This quite agrees with ANOVA testing of other ratios (F1 : F3, F2 : F3) and descriptive visualizations.

Checking conditions for ANOVA, we see that our data has a fairly normal distribution:

library(car)
qqPlot(lm(F1/F2 ~ Vowel, data=data))

The variance across the groups, however, significantly differs: but this does not seem to undermine the validity of the inference drown from the ANOVA test, all our groups containing equal number of cases (152 cases each).

bartlett.test(F1/F2 ~ Vowel, data=data)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  F1/F2 by Vowel
## Bartlett's K-squared = 466.5875, df = 9, p-value < 2.2e-16

Conclusion:

Having explored the variance of formant ratios within and across 10 vowel types with a series of ANOVA tests and linear regression models adapted for categorical variables, we found that formant ratios are a way far from random distribution and strongly associated with vowel types, which rejects the null hypothesis stating that vowel types are equal in respect of formant ratios.

The regression analysis also showed that formant ratios bare no significant dependence on gender and age (two important parameters, correlating with the length of speaker's vocal tract and the fundamental frequency of his or her voice).

It is necessary to note that in-group variance of formant ratios is however much larger than the latitude of the ratios in musical intervals and chords that makes possible their correct identification and satisfactory rendition, - but this difference between vowels and chords shall be addressed in a separate study.

Dataset:

The dataset used in the present study can be downloaded from http://teachers-lab.ru/Peterson.txt.

Different formats of the dataset can be found here: http://www.cs.cmu.edu/Groups/AI/areas/speech/database/pb/0.html.

References:

Bondarko L. V. (1984). Phonetic description of language and phonological description of speech, Leningrad State University, 1984.

Johnson K. (2005). Speaker normalization in speech perception // Handbook of speech perception, 363-389, 2005.

Nearey, T. M. (1978). Phonetic feature systems for vowels. Indiana University Linguistics Club, Bloomington, IN.

Peterson G. E, Barney H. L. (1952) Control methods used in a study of the vowels, JASA 24, 175-184, 1952.

Potter R. K., Steinberg J. C. (1950) Towards the specification of speech, JASA 22, 807-820, 1950.

Appendix

print(data[1:40,])
##    row Type Sex Speaker Vowel Signs  F0  F1   F2   F3
## 1    1    m   m       1     2     i 160 240 2280 2850
## 2    2    m   m       1     2     i 186 280 2400 2790
## 3    3    m   m       1     3     I 203 390 2030 2640
## 4    4    m   m       1     3     I 192 310 1980 2550
## 5    5    m   m       1     4     E 161 490 1870 2420
## 6    6    m   m       1     4     E 155 570 1700 2600
## 7    7    m   m       1     5     A 140 560 1820 2660
## 8    8    m   m       1     5     A 180 630 1700 2550
## 9    9    m   m       1     8     V 144 590 1250 2620
## 10  10    m   m       1     8     V 148 620 1300 2530
## 11  11    m   m       1     6     a 148 740 1070 2490
## 12  12    m   m       1     6     a 170 800 1060 2640
## 13  13    m   m       1     7     o 161 600  970 2280
## 14  14    m   m       1     7     o 158 660  980 2220
## 15  15    m   m       1     9     U 163 440 1120 2210
## 16  16    m   m       1     9     U 190 400 1070 2280
## 17  17    m   m       1    10     u 160 240 1040 2150
## 18  18    m   m       1    10     u 157 270  930 2280
## 19  19    m   m       1     1     R 177 370 1520 1670
## 20  20    m   m       1     1     R 164 460 1330 1590
## 21  21    m   m       2     2     i 147 220 2220 2910
## 22  22    m   m       2     2     i 148 210 2360 3250
## 23  23    m   m       2     3     I 141 410 1890 2680
## 24  24    m   m       2     3     I 139 420 1850 2500
## 25  25    m   m       2     4     E 136 500 1760 2590
## 26  26    m   m       2     4     E 135 510 1710 2380
## 27  27    m   m       2     5     A 128 690 1610 2560
## 28  28    m   m       2     5     A 131 700 1690 2580
## 29  29    m   m       2     8     V 140 650 1080 2420
## 30  30    m   m       2     8     V 125 625 1060 2490
## 31  31    m   m       2     6     a 140 650 1040 2450
## 32  32    m   m       2     6     a 136 670 1100 2430
## 33  33    m   m       2     7     o 149 580  580 2470
## 34  34    m   m       2     7     o 140 560  560 2410
## 35  35    m   m       2     9     U 145 450  940 1910
## 36  36    m   m       2     9     U 141 410  830 2240
## 37  37    m   m       2    10     u 140 280  650 3300
## 38  38    m   m       2    10     u 137 260  660 3300
## 39  39    m   m       2     1     R 145 510 1210 1570
## 40  40    m   m       2     1     R 145 510 1130 1510