Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
Download Dissertation
© 2020, Bastian Bechtold, Jade Hochschule & Carl von Ossietzky Universität Oldenburg, Germany.

Fundamental Frequency Ground Truth for Speech Corpora from Multi-Algorithm Consensus

The fundamental frequency of the human voice is an essential feature for various speech processing tasks such as speech recognition, speaker identification, and speech compression. Therefore, a large number of fundamental frequency estimation algorithms have been developed. To evaluate the performance of these algorithms, their estimates are often compared against a known ground truth fundamental frequency, typically derived from laryngograph recordings. However, laryngograph recordings are not available for all kinds of speech corpora, and can be tonal where the acoustic speech signal is not. Alternatively, fundamental frequency estimates of speech in noise are compared against clean speech estimates of a reference algorithm. While this works for arbitrary speech recordings, it is highly dependent on the reference algorithm. We therefore propose a new method for deriving a fundmental frequency ground truth from the consensus of a number of state-of-the-art fundamental frequency estimation algorithms, which can be calculated for any speech corpus, is more robust than a single algorithm's estimate, and which better reflects the acoustic tonality of speech.

This website contains the new consensus ground truth data as structured JBOF datasets datasets, as well as scripts for calculating the consensus truth from the following five corpora following five speech corpora:

The source code necessary for calculating the Consensus truth is available on Github:

https://github.com/bastibe/Consensus-Truth-Scripts

Source code for reading JBOF datasets in Python or Matlab is provided under a free license on Github.

Examples

The following graph compares the consensus truth with the laryngograph-based ground truths in the FDA [2], KEELE [3], and PTDB-TUG [4] corpora.

The majority of all estimates are very similar, as evidenced by the large maximum at 1, with dashed lines at the limits of gross correctness according to the common gross pitch error measure.

However, there are discrepancies between the ground truths and the consensus truth at lower frequencies, and particularly at the octave intervals (¹∕₃, ¹∕₂, 2), which indicate octave errors in the ground truths.

Thus, the consensus truth is broadly compatible with existing ground truths for the evaluation of fundamental frequency estimation algorithms, but more representative in edge cases such as subharmonics or obscured fundamentals, which would otherwise lead to octave errors.

A histogram of the resulting fundamental frequencies of common speech corpora is shown in the next graph:

Note that this histogram includes consensus truths for TIMIT [5], MOCHA-TIMIT [6], and CMU-ARCTIC [1], which do not have a ground truth of their own. Interestingly, this analysis shows significant differences in gender balance and frequency content between the corpora.

The full dissertation contains an entire chapter on these differences between the corpora, and their meaning for the evaluation of fundamental frequency estimation algorithms.

References:

  1. John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.
  2. P. C. Bagshaw, S. Hiller, and M. A. Jack, “Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching,” in EUROSPEECH, 1993.
  3. F. Plante, G. F. Meyer, and W. A. Ainsworth, “A Pitch Extraction Reference Database,” in Fourth European Conference on Speech Communication and Technology, Madrid, Spain, 1995, pp. 837–840.
  4. G. Pirker, M. Wohlmayr, S. Petrik, and F. Pernkopf, “A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario.” in INTERSPEECH, 2011, pp. 1509– 1512.
  5. A. Wrench, “MOCHA MultiCHannel Articulatory database: English,” Nov. 1999. [Online]. Available: http://www.cstr.ed.ac.uk/research/projects/artic/ mocha.html
  6. J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue, “TIMIT Acoustic-Phonetic Continuous Speech Corpus,” 1993. [Online]. Available: https://catalog.ldc.upenn.edu/ LDC93S1