Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
Download Dissertation
© 2020, Bastian Bechtold, Jade Hochschule & Carl von Ossietzky Universität Oldenburg, Germany.

Fundamental Frequency Estimation of the Human Voice with the Magnitude and Phase Spectrogram

The fundamental frequency of the human voice is an important feature for various speech processing applications such as speech enhancement, speech separation, and speech compression algorithms. This paper presents an algorithm that probabilistically combines features from the magnitude spectrum and the phase spectrum, and derives a direct pitch confidence measure, which avoids both octave errors and ambiguous estimates. The algorithm estimates fewer frames as voiced, but remains reliable even with high levels of noise. These characteristics are examined with synthetic tone complexes and a large, freely available corpus of speech and noise recordings.

Code

Source code for MaPS in Python, Matlab, and Julia can be downloaded by cloning this repository:

https://github.com/bastibe/MAPS-Scripts

The code is released under the terms of the GNU LGPL 3 license, © 2018, Bastian Bechtold, Jade Hochschule. This means that the source code can be read, used, and modified freely, but our authorship of the code must be recognized, and any source distributed with our code must be licensed under a LGPL-compatible license. Additionally, we kindly request feedback on how the code is used.

Methods

MaPS consists of two complementary features, one correlates a comb-like template with the magnitude spectrum, the other compares a sawtooth-like template with a derivation of the phase spectrum. The results are combined to a probabilistic measure, which serves both as voicing detector and as fundamental frequency estimator.

This combination solves both the octave ambiguities in the magnitude spectrum and the loudness ambiguities in the phase spectrum, and results in a robust and precise pitch confidence that excludes not only unlikely pitches, but ambiguous estimates as well.

Voice in the Magnitude Spectrum

In the magnitude spectrum, speech forms a comb pattern with comb teeth at the fundamental frequency and its harmonics. MaPS correlates a number of comb templates at different fundamental frequencies with the signal magnitude spectrum:


At about 115 Hz, all the peaks in the template match up with the peaks in the spectrum, and the correlation reaches its maximum. This point corresponds to the true fundamental frequency of this spectrum.

However, any magnitude-spectrum based measure is susceptible to octave errors, since any comb-like template correlates not only with the fundamental frequency, but also with higher harmonics. This is already addressed by introducing negative valleys between the positive comb teeth, but some ambiguity remains.

Voice in the Phase Spectrum

For the phase spectrum feature, MaPS uses the instantaneous frequency deviation, which is the difference between the instantanous frequency spectrum and the frequency f, or IF(f)-f. The instantaneous frequency is the time-derivative of the phase spectrum. The instantaneous frequency deviation for the same speech signal as discussed in the introduction looks like this:

Thus, speech forms a sawtooth pattern with zeroes at the fundamental frequency and its harmonics. MaPS compares a number of sawtooth templates at different fundamental frequencies with the instantanous frequency deviation of the signal spectrum:


At about 115 Hz, all the zeros in the template match up with the zeros in the spectrum, and the difference reaches its minimum. This frequency corresponds to the fundamental frequency for this spectrum.

However, any phase-spectrum based measure can not discern quiet parts of the signal from salient speech patterns, since even very quiet tones can have distinctive phase spectra.

Combination of Features

MaPS combines the magnitude feature and the phase feature in a Bayesan maximum a posteriori fundamental frequency estimator and voicing detector we call the pitch confidence. Since the error modes of the two features can never overlap, the pitch confidence can reduce their ambiguities and produce a highly reliable and precise measure for fundamental frequency estimation.

The results of this estimation on a few speech signals from the PTDB-TUG [1] database and noises from the QUT-NOISE [2] corpus can be seen in the following figure:

This figure shows how MaPS accurately recognizes the base frequency track in noisy speech recordings. In general, MaPS prefers to reject ambiguous frames over giving uncertain estimates, and thus accepts some false negatives in favor of too many false positives. For many applications, this is advantageous, since it is often more important to obtain reliable results than to get plentiful estimates.

Evaluation

The fundamental frequency estimation performance of MaPS was evaluated using a large database of f₀-annotated speech recordings from the PTDB-TUG [1] corpus, and acoustic noise recordings from the QUT-NOISE [2] corpus.

Additionally, the same samples were subjected to the well-known fundamental frequency estimation algorithms PEFAC [3], RAPT [4], and YIN [5]. The error measures for this evaluation are:

Gross Pitch Error
Percentage of frames that are correctly classified as pitched, and whose pitch is within 20 % of the true pitch.
Fine Pitch Error
Mean error of pitch estimates that are within 20 %.

This high precision of MaPS is in part due to its comparatively conservative voicing detector, which refuses to guess for ambiguous fundamentals, for example during phoneme transitions or noisy fricatives. The pitch confidence is thus a true probability of being correct, and not just a maximum likelihood measure.

Summary

The dissertation introduces a new algorithm for estimating the fundamental frequency of speech. The algorithm contains a voicing detection algorithm that has the unique ability to reject ambiguous estimates, and is thus reliable even in high levels of noise.

Please refer to the full dissertation for greater detail on these topics, and a full description of the algorithms and methods, as well as the evaluation measures.

References:

  1. Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.
  2. David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.
  3. Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014.
  4. David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995.
  5. Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.