analyze
There are numerous programs out there for performing acoustic analysis, including several open-source options and R packages such as seewave and warbleR. Soundgen offers tools for visualizing audio (oscillograms, spectrograms, modulation spectra, self-similarity matrices), pitch tracking and formant analysis (both as standard functions and as interactive web apps), audio segmentation, as well as more specialized functions for estimating subjective loudness, roughness, acoustic novelty, surprisal, etc.
Many of the large variety of existing tools for acoustic analysis were designed with a particular type of sound in mind, usually human speech or bird songs. Soundgen was originally developed to work with human nonverbal vocalizations such as screams and laughs. These sounds are much harsher and noisier than ordinary speech, but they closely resemble vocalizations of other mammals. Acoustic analysis with soundgen may therefore be particularly useful for extracting a large number of acoustic predictors from collection of vocalizations of humans and other mammals, for example:
The most relevant soundgen functions for acoustic analysis are:
analyze
: extracts a number of acoustic predictors such
as pitch, harmonics-to-noise ratio, mean frequency, peak frequency,
formants, roughness, intensity and loudness, etc. The output includes a
summary per file, with each variable presented as mean / median / SD /
…, as well as detailed statistics per STFT framepitch_app
: opens a shiny app in a web browser for
manually correcting and exporting the pitch contours extracted by
analyze
formant_app
: another web app, for annotating and
manually correcting formant measurements provided by
phonTools::findformants()segment
: segments a long recording into individual
syllables using spectral signal-noise separationssm
: produces a self-similarity matrix and calculates a
novelty contourgetLoudness
: estimates the subjective loudness per STFT
frame, in sonemodulationSpectrum
: calculates a joint
temporal-spectral modulation spectrum and a measure of acoustic
roughnessgetRMS
: measures the root mean square amplitude of
audio filesspectrogram
, osc
: basic plotsTIP For a tour-de-force overview of alternatives together with highly accessible theoretical explanations of sound characteristics, see Sueur (2018) “Sound analysis and synthesis with R”
This vignette is designed to show how soundgen can be used effectively to perform acoustic analysis. It assumes that the reader is already familiar with key concepts of phonetics and bioacoustics.
To demonstrate acoustic analysis in practice, let’s begin by generating a sound with a known pitch contour. To make pitch tracking less trivial and demonstrate some of its challenges, let’s add some noise, subharmonics, and jitter:
## Loading required package: shinyBS
## Soundgen 2.7.0
## Tips & demos on project's homepage: http://cogsci.se/soundgen.html
## Please cite as: Anikin, A. (2019). Soundgen: an open-source tool for synthesizing nonverbal vocalizations. Behavior Research Methods, 51(2), 778-792.
s1 = soundgen(sylLen = 900, temperature = 0.001,
pitch = list(time = c(0, .3, .8, 1),
value = c(300, 900, 400, 1300)),
noise = c(-40, -20),
subFreq = 100, subDep = 20, jitterDep = 0.5,
plot = TRUE, ylim = c(0, 4))
Mathematically, digital audio is a time series - a long vector of real numbers representing sound pressure levels over time. This time series can be visualized in different domains:
osc()
seewave::spec()
,
seewave::meanspec()
. If we want to see how this frequency
composition varies over time, we produce a spectrogram such as
the plot above. The most common method for doing this is Short-Time
Fourier Transform (STFT, function spectrogram()
), but it is
also possible to use a bank of bandpass filters (function
audSpectrogram()
) or the wavelet transform (e.g., function
wavelets::dwt()
).Long-time average spectrum:
Spectrogram with an oscillogram shown underneath:
spectrogram(
s1, samplingRate = 16000,
osc = 'dB', # plot oscillogram in dB
heights = c(2, 1), # spectro/osc height ratio
noiseReduction = .5, # subtract the spectrum of noisy parts
brightness = -1, # reduce brightness
contrast = .5, # increase contrast
colorTheme = 'heat.colors', # pick color theme
cex.lab = .75, cex.axis = .75, # text size and other base graphics pars
grid = 5, # lines per kHz; to customize, add manually with graphics::grid()
ylim = c(0, 5), # always in kHzmain = 'Spectrogram')
main = 'Spectrogram'
)
Auditory spectrogram of the same sound, obtained with a bank of filters, on the bark frequency scale :
audSpectrogram(s1, samplingRate = 16000,
nFilters = 128, step = 5,
colorTheme = 'terrain.colors',
main = 'Auditory spectrogram')
Time-frequency reassigned spectrogram of the same sound showing f0 contours:
spectrogram(
s1, samplingRate = 16000,
specType = 'reassigned',
windowLength = 20, step = 5, yScale = 'bark',
main = 'Reassigned spectrogram')
Time-frequency reassigned spectrogram of the same sound showing formants:
spectrogram(
s1, samplingRate = 16000,
specType = 'reassigned',
windowLength = 2, step = 1,
main = 'Reassigned spectrogram')
analyze()
) or
spectrogram (function modulationSpectrum()
- more detail in
section 7).modulationSpectrum(s1, samplingRate = 16000, colorTheme = 'seewave',
plot = TRUE, main = 'Modulation spectrum')
And here is a phasegram of the same sound, showing Poincare sections through its phase portrait:
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Setting timeLag to 0.2 ms
TIP: most soundgen functions for acoustic analysis and
visualization accept as input both a single sound and a path to folder,
in which case all the audio files in this folder are processed in one
go. For example, copy a few recordings into ‘~/Downloads/temp’ and run
spectrogram('~/Downloads/temp', savePlots = '~/Downloads/temp/plots')
–> a separate .png file will be saved for each recording. Likewise,
`analyze(‘path_to_folder’) analyzes all audio files in the target
folder.
analyze
Soundgen has specialized functions for particular types of acoustic
analysis. However, for convenience, most of them are called internally
by analyze()
, which provides basic spectral descriptives,
pitch tracking, formant analysis, as well as estimates of roughness,
subjective loudness, novelty, etc.
At the heart of acoustic analysis with soundgen is the short-time Fourier transform (STFT): we look at one short segment of sound at a time (one STFT frame), analyze its spectrum using Fast Fourier Transform (FFT), and then move on to the next - perhaps overlapping - frame. To analyze a sound with default settings and plot its spectrogram, all we need to specify is its sampling rate (the default in soundgen is 16000 Hz):
## Registered S3 method overwritten by 'phonTools':
## method from
## plot.spectrogram nonlinearTseries
# a1$detailed # many acoustic predictors measured for each STFT frame
median(a1$detailed$pitch, na.rm = TRUE) # our estimate of median pitch
## [1] 764.521
There are several key parameters that control the behavior of STFT
and affect all extracted acoustic variables. The same parameters serve
as arguments to spectrogram
. As a result, you can
immediately see what frame-by-frame input you have fed into the
algorithm for acoustic analysis by visually inspecting the produced
spectrogram. If you can hear f0, but can’t see individual harmonics in
the spectrogram, the pitch tracker probably will not see them, either,
and will therefore fail to detect f0 correctly. The first remedy is thus
to adjust STFT settings, using the spectrogram for visual feedback:
windowLength
: the length of sliding STFT window. Longer
windows (e.g., 40 - 50 ms) improve frequency resolution at the expense
of time resolution, so they are good for detecting relatively low,
slowly changing f0. Shorter windows (e.g., 5 - 10 ms) improve time
resolution at the expense of frequency resolution, so they are good for
visualizing formants or for tracking high-frequency, rapidly changing f0
as in bird chirps or dolphin whistles.step
: the step of sliding STFT window. For example, if
windowLength = 50
and step = 25
, each time we
move the analysis frame, there is a 50% overlap with the previous frame.
Short steps improve time resolution while maintaining relatively high
frequency resolution at the price of processing time. For noisy
recordings, pitch contour may actually be more accurate with relatively
large steps and more smoothing.wn
: the type of windowing function used to taper the
analysis frame during STFT. In practice the windowing function doesn’t
seem to have a major effect on the result, as long as you choose
something reasonable like gaussian, hanning, or bartlett.zp
: zero-padding. You can use a short STFT window and
improve its frequency resolution by padding each frame with zeroes. This
is a computational trick for improving frequency resolution while
maintaining relatively high time resolution.silence
: frames with root mean square (RMS) amplitude
below silence threshold are not analyzed at all. Quiet frames are harder
to analyze, because their signal-to-noise ratio is lower. As a result,
we want to strike a good balance. Setting silence
too low
(close to 0) produces a lot of garbage, as the algorithm tries to
analyze frames that are essentially just background noise without any
signal. Setting silence
too high (close to 1) excludes too
many perfectly good frames, misrepresenting the signal. In soundgen
silence
is dynamically updated: it can never be lower than
specified, but it may be raised to the minimum root mean square
amplitude of all frames, if this minimum is higher than
silence
. This ensures that empty frames are not analyzed in
recordings with unusually high levels of steady background noise (e.g.,
microphone hiss).Apart from pitch tracking, analyze
calculates and
returns several acoustic characteristics from each non-silent STFT
frame:
time
: time of the middle of each frame (ms)duration
: total duration (s)duration_noSilence
: duration from the beginning of the
first non-silent STFT frame to the end of the last non-silent STFT
frame, s (NB: depends strongly on windowLength
and
silence
settings)amEnvFreq
,amEnvDep
: frequency (Hz) and
depth (0 to 1) of amplitude modulation estimated from a smoothed
amplitude envelopeamMsFreq
, amMsPurity
: the frequency (Hz)
and purity (dB) of amplitude modulation estimated via
modulationSpectrum
, the arguments to which are passed in
roughness = list()
)ampl
: root mean square of amplitude per frame,
calculated as sqrt(mean(frame ^ 2))
dom
: lowest dominant frequency band (Hz) (see “Pitch
tracking methods / Dominant frequency”)entropy
: Weiner entropy of the spectrum of the current
frame: 0 = pure tone or tonal sound with all energy in harmonics, 1 =
white noiseentropySh
: Normalized Shannon entropy of the spectrum
of the current frame: 0 = pure tone, 1 = white noisef1_freq
, f1_width
, …: the frequency and
bandwidth of the first nFormants
formants per STFT frame,
as calculated by phonTools::findformants
with default
settingsflux
: feature-based flux, the rate of change in
acoustic features such as pitch, HNR, etc.; unitless (0 = none, 1 =
max); an epoch
is an audio segment between two peaks of
flux that exceed a threshold of flux = list(thres = ...)
(listed in output$detailed only)fmDep
, fmFreq
, fmPurity
: the
depth (semitones), frequency (Hz), and purity (0 to 1) of frequency
modulation (FM) such as vibrato or jitterharmEnergy
: the amount of energy in upper harmonics,
namely the ratio of total spectral energy above 1.25 x f0 to the total
spectral energy below 1.25 x f0 (dB)harmHeight
: how high harmonics reach in the spectrum,
based on the best guess at pitch (or the manually provided pitch
values): see soundgen:::harmHeight
for detailsHNR
: harmonics-to-noise ratio (dB), a measure of
harmonicity returned by soundgen:::getPitchAutocor()
(see
“Pitch tracking methods / Autocorrelation”). If HNR = 0 dB, there is as
much energy in harmonics as in noiseloudness
: subjective loudness in sone, assuming a
certain sound pressure level (takes into account the energy in different
frequency bands as well as the sensitivity of human ears to different
frequencies); see getLoudness()
and the section on Loudness
below for detailsnovelty
: spectral novelty - a measure of how variable
the spectrum is on a particular time scale, as estimated by
ssm()
peakFreq
: the frequency with maximum spectral energy
(Hz)quartile25
, quartile50
,
quartile75
: the 25th, 50th, and 75th quantiles of the
spectrum below cutFreq
(Hz) for VOICED framesroughness
: the amount of spectrotemporal modulation in
the “roughness” zone of frequencies (estimated by
modulationSpectrum
, the arguments to which are fpassed in
roughness = list()
)specCentroid
: the center of gravity of the frame’s
spectrum below cutFreq
, first spectral moment (Hz)specSlope
: the slope of linear regression fit to the
spectrum below cutFreq
voiced
: is the current STFT frame voiced? TRUE /
FALSETIP: many of these descriptors are calculated separately for the entire sound and only for the voiced frames, resulting in extra output variables like “amplVoiced”. Output “voiced” gives the proportion of voiced frames out of the total sound duration, while “voiced_noSilence” corresponds to the proportion of voiced frames out of non-silent frames (ie with RMS amplitude above the “silence” threshold), which can exceed 1 if pitch is interpolated over putatively silent frames.
The function soundgen::analyze
returns a few spectral
descriptives that make sense for nonverbal vocalizations, but additional
predictors may be useful for other applications (bird songs,
non-biological sounds, etc.). Many popular spectral descriptors are
mathematically trivial to derive - all you need is the spectrum for each
STFT frame, or perhaps even the average spectrum of the entire sound.
Here is how you can get these spectra.
For the average spectrum of an entire sound, go no further than
seewave::spec
or seewave::meanspec
:
spec = seewave::spec(s1, f = 16000, plot = FALSE) # FFT of the entire sound
avSpec = seewave::meanspec(s1, f = 16000, plot = FALSE) # STFT followed by averaging
# either way, you get a dataframe with two columns: frequencies and their strength
head(avSpec)
## x y
## [1,] 0.00000 0.0003152356
## [2,] 0.03125 0.0012068170
## [3,] 0.06250 0.0026919476
## [4,] 0.09375 0.0038421574
## [5,] 0.12500 0.0045391107
## [6,] 0.15625 0.0057484158
If you are interested in how the spectrum changes over time, extract
frame-by-frame spectra - for example, with
spectrogram(..., output = 'original')
:
spgm = spectrogram(s1, samplingRate = 16000, output = 'original', plot = FALSE)
# rownames give you frequencies in KHz, colnames are time stamps in ms
str(spgm)
## num [1:400, 1:47] 3.44e-04 2.67e-04 1.33e-04 8.16e-05 5.45e-05 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:400] "0" "0.02" "0.04" "0.06" ...
## ..$ : chr [1:47] "0" "25" "50" "75" ...
Let’s say you are working with frame-by-frame spectra and want to calculate skewness, the 66.6th percentile, and the ratio of energy above/below 500 Hz. Before you go hunting for a piece of software that returns exactly those descriptors, consider this. Once you have normalized the spectrum to add up to 1, it basically becomes a probability density function (pdf), so you can summarize it in the same way as you would any other distribution of a random variable. Look up the formulas you need and just do the raw math:
# Transform spectrum to pdf (all columns should sum to 1):
spgm_norm = apply(spgm, 2, function(x) x / sum(x))
# Set up a dataframe to store the output
out = data.frame(skew = rep(NA, ncol(spgm)),
quantile66 = NA,
ratio500 = NA)
# Process each STFT frame
for (i in 1:ncol(spgm_norm)) {
# Absolute spectrum for this frame
df = data.frame(
freq = as.numeric(rownames(spgm_norm)), # frequency (kHz)
d = spgm_norm[, i] # density
)
# plot(df, type = 'l')
# Skewness (see https://en.wikipedia.org/wiki/Central_moment)
m = sum(df$freq * df$d) # spectral centroid, kHz
out$skew[i] = sum((df$freq - m)^3 * df$d)
# 66.6th percentile (2/3 of density below this frequency)
out$quantile66[i] = df$freq[min(which(cumsum(df$d) >= 2/3))] # in kHz
# Energy above/below 500 Hz
out$ratio500[i] = sum(df$d[df$freq >= .5]) / sum(df$d[df$freq < .5])
}
## Warning in min(which(cumsum(df$d) >= 2/3)): no non-missing arguments to min;
## returning Inf
## skew quantile66 ratio500
## Min. : 0.01562 Min. :0.020 Min. : 0.00146
## 1st Qu.: 2.29480 1st Qu.:1.340 1st Qu.: 19.08117
## Median : 2.58338 Median :1.400 Median : 37.32690
## Mean : 2.80528 Mean :1.183 Mean : 50.95397
## 3rd Qu.: 2.75821 3rd Qu.:1.420 3rd Qu.: 90.95906
## Max. :13.32936 Max. :1.500 Max. :148.25746
## NA's :1 NA's :1 NA's :1
If you need to do this analysis repeatedly, just wrap the code into
your own function that takes a wav file as input and returns all these
spectral descriptives. You can also save the actual spectra of different
sound files and add them up to obtain an average spectrum across
multiple sound files, work with cochleograms instead of raw spectra
(check out tuneR::melfcc
), etc. Be your own boss!
Fundamental frequency (f0) or its perceptual equivalent - pitch - is both highly salient to listeners and notoriously difficult to measure accurately. The approach followed by soundgen’s pitch tracker is to use several different estimates of f0, each of which is better suited to certain types of sounds. You can use any pitch tracker individually, but their output is also automatically integrated and postprocessed so as to generate the best overall estimate of frame-by-frame pitch. Autocorrelation is needed to calculate the harmonics-to-noise ratio (HNR) of an STFT frame, and then this information is used to adjust the expectations of the cepstral and spectral algorithms. In particular, if autocorrelation suggests that the pitch is high, confidence in cepstral estimates is attenuated; and if autocorrelation suggests that HNR is low, thresholds for spectral peak detection are raised, making spectral pitch estimates more conservative.
The plot below shows a spectrogram of the sound with overlaid pitch
candidates generated by five different methods (listed in
pitchMethods
), with a very vague prior - that is, with no
specific expectations regarding the true range of pitch values. The size
of each point shows the certainty of estimation: smaller points are
calculated with lower certainty and have less weight when all candidates
are integrated into the final pitch contour (blue line).
a = analyze(s1, samplingRate = 16000, priorSD = 24,
pitchMethods = c('autocor', 'cep', 'dom', 'spec', 'hps', 'zc'),
plot = TRUE, ylim = c(0, 4))
Different pitch tracking methods have their own pros and cons. Cepstrum is helpful for speech but pretty useless for high-frequency whistles or screams, harmonic product spectrum (hps) is easily misled by subharmonics (as in this example), lowest dominant frequency band (dom) can’t handle low-frequency wind noise or missing fundamental, etc. The default is to use “dom” and “autocor” as the most generally applicable, but you can experiment with all methods and check which ones perform best with the specific type of audio that you are analyzing. Each method can also be fine-tuned (see below), but first it is worth considering the general pitch-related settings.
analyze
has a few arguments that affect all methods of
pitch tracking:
entropyThres
: all non-silent frames are analyzed to
produce basic spectral descriptives. However, pitch tracking is both
computationally costly and can be misleading if applied to obviously
voiceless frames. To define what an “obviously voiceless” frame is, we
set some cutoff value of Weiner entropy, above which we don’t want to
even try pitch tracking. To disable this feature and track pitch in all
non-silent frames, set entropyThres = NULL
.pitchFloor
, pitchCeiling
: absolute
thresholds for pitch candidates. No values outside these bounds will be
considered.priorMean
and priorSD
specify the mean and
sd of prior distribution describing our prior knowledge about the most
likely pitch values. If priorAdapt = TRUE
, this
user-defined prior is only applied to the first pass of optimizing pitch
contours, after which the first-pass contour becomes the new prior for
the second pass. Both of these priors work by scaling the certainties
associated with particular pitch candidates. If you are working with a
single type of sound, such as speech by a male speaker or cricket
sounds, specifying a strong prior can greatly improve the quality of the
resulting pitch contour. When batch-processing a large number of sounds,
the recommended approach is to set a vague, but still mildly informative
prior.Note that priorMean
is specified in Hz and
priorSD
in semitones. For example, if we expect f0 values
of about 300 Hz plus-minus half an octave (6 semitones), a prior can be
defined as priorMean = 300, priorSD = 6
. For convenience,
the prior can be plotted with getPrior
:
par(mfrow = c(1, 2))
# default prior in soundgen
getPrior(priorMean = 300, priorSD = 6)
# narrow peak at 2 kHz
getPrior(priorMean = 2000, priorSD = 1)
par(mfrow = c(1, 1))
TIP The final pitch contour can still pass through low-certainty
candidates, so the prior is a soft alternative (or addition) to the
inflexible bounds of pitchFloor
and
pitchCeiling
But the prior has a major impact on pitch
tracking, so it is by default shown in every plot
nCands
: maximum number of pitch candidates to use per
method. This doesn’t affects dom
pitch candidates (only a
single value of the lowest dominant frequency is used regardless).minVoicedCands
: minimum number of pitch candidates that
have to be defined to consider a frame voiced. It defaults to ‘autom’,
which means 2 if dom
is among the candidates and 1
otherwise. The reason is that dom
is usually defined, even
if the frame is clearly voiceless, so we want another pitch candidate in
addition to dom
before we classify the frame as
voiced.Having looked at the general settings, it is time to consider the
theoretical principles behind each pitch tracking method, together with
arguments to analyze
that can be used to tweak each
one.
Time domain: pitch by autocorrelation, pitchAutocor
.
This is an R implementation of the algorithm used in the popular
open-source program PRAAT (Boersma, 1993). The basic idea is that a
harmonic signal correlates with itself most strongly at a delay equal to
the period of its fundamental frequency (f0). Peaks in the
autocorrelation function are thus treated as potential pitch candidates.
The main trick is to choose an appropriate windowing function and adjust
for its own autocorrelation. Compared to other methods implemented in
soundgen, pitch estimates based on autocorrelation appear to be
particularly accurate for relatively high values of f0. The settings
that control pitchAutocor
are:
autocorThres
: voicing threshold, defaults to 0.7. This
means that peaks in the autocorrelation function have to be at least 0.7
in height (1 = perfect autocorrelation). A lower threshold produces more
false positives (f0 is detected in voiceless, noisy frames), whereas a
higher threshold produces more accurate values f0 at the expense of
failing to detect f0 in noisier frames.autocorSmooth
: the width of smoothing interval (in
bins) for finding peaks in the autocorrelation function. If left NULL,
it defaults to 7 for sampling rate 44100 and smaller odd numbers for
lower sampling rate.autocorUpsample
: upsamples the autocorrelation function
in high frequencies in order to improve the resolution of analysis.autocorBestPeak
: amplitude of the lowest best candidate
relative to the absolute maximum of the autocorrelation function.To use only autocorrelation pitch tracking, but with lower-than-default voicing threshold and more candidates, we can do something like this (prior is disabled so as not to influence the certainties of different pitch candidates):
Frequency domain: the lowest dominant frequency band,
dom
.
If the sound is harmonic and relatively noise-free, the spectrum of a frame typically has little energy below f0. It is therefore likely that the first spectral peak is in fact f0, and all we have to do is choose a reasonable threshold to define what counts as a peak. Naturally, there are cases of missing f0 and misleading low-frequency noises. Nevertheless, this simple estimate is often surprisingly accurate, and it may be our best shot when the vocal cords are vibrating in a chaotic fashion (deterministic chaos). For example, sounds such as roars lack clear harmonics but are perceived as voiced, and the lowest dominant frequency band (which may also be the first or second formant) often corresponds to perceived pitch.
The settings that control dom
are:
domThres
(defaults to 0.1, range 0 to 1): to find the
lowest dominant frequency band, we look for the lowest frequency with
amplitude at least domThres
. This key setting has to be
high enough to exclude accidental low-frequency noises, but low enough
not to miss f0. As a result, the optimal level depends a lot on the type
of sound analyzed and recording conditions.domSmooth
(defaults to 220 Hz): the width of smoothing
interval (Hz) for finding the lowest spectral peak. The idea is that we
are less likely to hit upon some accidental spectral noise and find the
lowest harmonic (or the lowest spectral band with significant power) if
we apply some smoothing to the spectrum of an STFT frame, in this case a
moving median.For the sound we are trying to analyze, we can increase
domSmooth
and/or raise domThres
to ignore the
subharmonics and trace the true pitch contour:
Frequency domain: pitch by cepstrum, pitchCep
.
Cepstrum is the inverse FFT (or the real part of FFT) of log-spectrum. It may be a bit challenging to wrap one’s head around, but the main idea is quite simple: just as FFT is a way to find periodicity in a signal, cepstrum is a way to find periodicity in the spectrum. In other words, if the spectrum contains regularly spaced harmonics, its FFT will contain a peak corresponding to this regularity. And since the distance between harmonics equals the fundamental frequency, this cepstral peak gives us f0. Cepstrum is not very useful when f0 is so high that the spectrum contains only a few harmonics, so soundgen automatically discounts the contribution of high-frequency cepstral estimates.
In addition to pitch tracking, cepstrum is useful for obtaining a
measure of voice quality, especially breathiness, called Cepstral Peak
Prominence or CPP
. This is the ratio of the highest
cepstral peak (presumably corresponding to f0) to the trend line over
cepstrum - basically, it shows whether cepstrum has a clear peak.
CPP
is measured in dB and can be used as a measure of voice
quality along with HNR
, harmHeight
, and
harmEnergy
. CPP
is only returned if cepstral
pitch tracking is performed - that is, if cep
is among
pitchMethods
. The trend line can be calculated in many
ways; soundgen fits a simple least squares straight line in the
frequency-cepstrum space (hyperbola in the quefrency-cepstrum space).
Parabolic interpolation of cepstral peaks is used to improve frequency
resolution.
The settings that control pitchCep
are:
cepThres
: voicing threshold (defaults to 0.7).cepZp
(defaults to 0): zero-padding of the spectrum
used for cepstral pitch detection (points). Zero-padding may improve the
precision of cepstral pitch detection, but it also slows down the
algorithm.Cepstral pitch tracking works best in low-pitched sounds with a lot of visible harmonics - for example, in speech. It doesn’t perform well when there are strong subharmonics or when f0 goes too high (as below).
Frequency domain: common factor of harmonics or ratios of harmonics,
pitchSpec
.
All harmonics are multiples of the fundamental frequency. If we can
detect several harmonics, we can therefore find f0 as the highest common
ratio of harmonics (method = “commonFactor”). Alternatively, we can
exploit the fact that the ratio of two neighboring harmonics is
predictably related to their rank relative to f0 (method = “BaNa”). For
example, (3 * f0) / (2 * f0) = 1.5
, so if we find two
harmonics in the spectrum that have a ratio of exactly 1.5, it is likely
that f0 is half the lower one (Ba et al., 2012).
The settings that control pitchSpec
are:
specMethod
: “commonFactor” = highest common factor of
putative harmonics, “BaNa” = ratio of putative harmonicsspecRatios
: for method = “commonFactor”, the number of
harmonics AND integer fractions to considerspecMerge
(0 to Inf semitones, defaults to 1): pitch
candidates within specMerge
semitones are merged with
boosted certainty. Since the idea behind the spectral pitch tracker is
that multiple harmonic ratios should converge on the same f0, we have to
decide what counts as “the same” f0specThres
(0 to 1, defaults to 0.1): voicing threshold
for pitch candidates suggested by the spectral method. The scale is 0 to
1, as usual, but it is the result of a rather arbitrary normalization.
The “strength” of spectral pitch candidates is basically calculated as a
sigmoid function of the number of common factors or harmonic ratios that
imply (nearly) the same f0 value. Setting specThres
too low
may produce garbage, while setting it too high makes the spectral method
excessively conservative.specPeak
(0 to 1, defaults to 0.25),
specHNRslope
(0 to Inf, defaults to 0.8): when looking for
putative harmonics in the spectrum, the threshold for peak detection is
calculated as specPeak * (1 - HNR * specHNRslope)
. For
noisy sounds the threshold is high to avoid false sumharmonics, while
for tonal sounds it is low to catch weak harmonics. If HNR
(harmonics-to-noise ratio) is not known, say if we have disabled the
autocorrelation pitch tracker or if it returns NA for a frame, then the
threshold defaults to simply specPeak
. This key parameter
strongly affects how many pitch candidates the spectral method
suggests.specSmooth
(0 to Inf, defaults to 150 Hz): the width of
window for detecting peaks in the spectrum, in Hz. You may want to
adjust it if you are working with sounds with a specific f0 range,
especially if it is unusually high or low compared to human sounds.specSinglePeakCert
: (0 to 1, defaults to 0.4) if
apitchSpec
candidate is calculated based on a single
harmonic ratio (as opposed to several ratios converging on the same
candidate), its weight (certainty) is taken to be
specSinglePeakCert
. This mainly has implications for how
much we trust spectral vs. other pitch estimates.Frequency domain: pitchHps
.
This is a simple spectral method based on downsampling the spectrum several times and multiplying the resulting spectra. This emphasizes the lowest harmonic present in the signal, which is hopefully f0. By definition, this method is easily misled by subharmonics (additional harmonics between the main harmonics of f0), but it can be useful in situations when the subharmonic frequency is actually of interest.
The settings that control pitchHps
are:
hpsThres
(0 to 1, defaults to 0.3): voicing threshold
for pitch candidates suggested by hps
methodhpsNum
(defaults to 5): the number of times the
spectrum is downsampled. Increasing the number improves sensitivity in
the sense that the method converges on the lowest harmonic, which is
generally (but not always) desirablehpsNorm
: the amount of inflation of hps pitch certainty
(0 = none). Because the downsampled spectra are multiplied, the height
of the resulting peak tends to be rather low; hpsNorm
(defaults to 2, 0 = none) compensates for it, otherwise this method
would have very low confidence compared to other pitch trackershpsPenalty
(defaults to 2, 0 = none): the amount of
penalizing hps candidates in low frequencies (0 = none). HPS doesn’t
perform very well at low frequencies, so the certainty in low-frequency
candidates is attenuatedTime domain: pitchZc
.
This is a very simple and fast time-domain method based on detecting
zero crossings after bandpass-filtering the audio. This method only
works if f0 is strong enough (e.g., well below the first formant in
speech) and can be suitable for rapidly analyzing long speech
recordings. Note that a relatively low pitchCeiling
must be
specified, essentially just above the upper values of f0.
The settings that control pitchZc
are:
zcThres
(0 to 1, defaults to 0.1): pitch candidates
with certainty below this value are treated as noise and set to NAzcWin
(odd integer > 3): certainty in pitch
candidates depends on how stable pitch is over zcWin glottal cyclesa = analyze(s1, samplingRate = 16000,
plot = TRUE, ylim = c(0, 2), osc = FALSE,
priorMean = NA, priorAdapt = FALSE,
pitchMethods = 'zc',
pitchCeiling = 600, # higher pitch values blend with F1
pitchZc = list(zcWin = 3, zcThres = 0))
TIP Most pitch tracking methods can be tweaked to produce
reasonable results for a particular sound (read: to agree with human
intuition). The real trick is to find settings that are accurate on
average, across a wide range of sounds and recording conditions. The
default settings in analyze
are the result of optimization
against manually verified pitch measurements in human non-linguistic
vocalizations. For other types of sounds, you may need to perform your
own manual tweaking and/or formal optimization.
The perception of pitch does not depend on the presence of the lowest
partial corresponding to the actual fundamental frequency, but this
“missing fundamental” problem may affect pitch tracking. This situation
can arise if a recording is high-pass filtered or if there is a lot of
low-frequency noise such as wind in the microphone. By definition, the
“dom” estimate of pitch cannot function when the lowest partial is
missing (it works by literally tracking the lowest dominant frequency
band), and this is also a problem for pitchMethods = "zc"
(zero crossings). Other pitch tracking methods have no problem dealing
with a missing fundamental frequency because they take the entire
spectrum into account, not only the lowest partial. For
pitchMethods = "spec"
, however, remember to switch to BaNa
algorithm (see example below).
A sound with implicit pitch = 300 Hz based on partials at 600 Hz, 900 Hz, and 1200 Hz (f0 at 300 Hz is missing):
s_withoutf0 = soundgen(sylLen = 600, pitch = 300,
rolloffExact = c(0, 1, 1, 1), formants = NULL, lipRad = 0)
# playme(s_withoutf0) # you can clearly hear the difference
seewave::meanspec(s_withoutf0, f = 16000, dB = 'max0', flim = c(0, 3))
No problem with pitch tracking (except for the dom
and
zc
methods):
a_withoutf0 = analyze(s_withoutf0, 16000, priorMean = NA, priorAdapt = FALSE,
pitchMethods = c('autocor', 'dom', 'cep', 'spec', 'hps', 'zc'),
pitchSpec = list(specMethod = 'BaNa'),
plot = TRUE, ylim = c(0, 2), dynamicRange = 60, osc = FALSE)
The implications are as follows: if the lower part of your signal is degraded (wind noise, an engine running, somebody else talking in the background, etc.), you can apply a high-pass filter to remove low frequencies. Even if you filter out the first partial by doing so, pitch tracking will still be possible.
BUT: do NOT use the “dom” or “zc” pitch estimates if f0 is either filtered out or invisible because of noise!
Pitch postprocessing in soundgen includes a whole battery of distinct
operations through which the pitch candidates generated by one or more
tracking methods are integrated into the final pitch contour. We will
look at them one by one, in the order in which they are performed in
analyze
. But first of all, here is how to disable them
all:
a = analyze(
s1,
samplingRate = 16000, plot = TRUE, ylim = c(0, 4), priorMean = NA,
shortestSyl = 0, # any length of voiced fragments
interpol = NULL, # don't interpolate missing f0 values
pathfinding = 'none', # don't look for optimal path through candidates
snakeStep = 0, # don't run the snake
smooth = 0 # don't run median smoothing
)
When the sound is not too tricky and enough pitch candidates are available, postprocessing actually makes little difference. In terms of the accuracy of median estimate of f0, you are likely to get a good result even with postprocessing is completely disabled. However, if you are interested in the actual intonation contours, not just the global average, postprocessing can help a lot.
The first stage of postprocessing is to divide the sound into
continuous voiced fragments or “syllables” based on their minimum
expected length (shortestSyl
) and spacing
(shortestPause
). Voiced fragments shorter than
shortestSyl
get discarded (assumed to be unvoiced). If two
voiced fragments are separated by less than shortestPause
,
they are merged.
To make sure we have at least one pitch candidate for every frame in
the supposedly continuous voiced fragment, we interpolate to fill in any
missing values. The same algorithm also adds new pitch candidates with
certainty interpol$cert
if a frame has no pitch candidates
within interpol$tol
of the median of the “center of
gravity” estimate over plus-minus interpol$win
frames. The
frequency of new candidates is equal to this median. For example, if
interpol$tol = 0.05
, new candidates are calculated if there
are none within 0.95 to 1.05 times the median over the interpolation
window. Interpolation thus has the effect of smoothing a pitch contour,
avoiding discontinuities or sudden jumps. You can also enable
interpolation to fill in unvoiced frames, but without adding new pitch
candidates in voiced frames. To do so, set
interpol$tol = Inf
.
Here is an example (interpolated segments are shown with a dotted line)
a1 = analyze(s1, samplingRate = 16000, step = 10, priorMean = NA,
pitchMethods = 'cep', pitchCep = list(cepThres = .8),
interpol = NULL, # disable interpolation
pathfinding = 'none',
snakeStep = 0, smooth = 0, plot = FALSE)
a2 = analyze(s1, samplingRate = 16000, step = 10, priorMean = NA,
pitchMethods = 'cep', pitchCep = list(cepThres = .8),
interpol = list(win = 100, tol = .1),
pathfinding = 'none',
snakeStep = 0, smooth = 0, plot = FALSE)
plot(a1$detailed$time, a1$detailed$pitch, type = 'l', main = 'Interpolation',
xlab = 'Time, ms', ylab = 'Pitch, Hz')
points(a2$detailed$time, a2$detailed$pitch, type = 'l',
col = 'red', lty = 3)
At the next stage, an optimal pitch contour is found through pitch
candidates within each voiced syllable. There may be multiple pitch
candidates for each frame, and each candidate comes with a particular
degree of certainty. We want to find a good path through these
candidates - that is, a pitch contour that both passes close to the
strongest candidates and minimizes pitch jumps, producing a relatively
smooth contour. The default is to use a modification of the Viterbi
algorithm - see ?soundgen:::pathfinder()
for details. A
complementary or alternative approach is to run a “snake” algorithm,
which wiggles the pitch contour under a weighted combination of elastic
forces and the pull of high-certainty pitch candidates. Note that the
snake can add new values different from existing candidates, and it is
disabled by default.
The final postprocessing stage is median smoothing. It is
conceptually similar to interpolation, except that by now there is only
a single f0 value left per frame, so we can forget about the multiple
candidates and their certainties. If smooth
is a positive
number, contours of the variables in smoothVars
are
smoothed using a customized version of median smoothing. This modifies
only the values that deviate considerably from the moving median and
preserves all other values (so this is a bit different from applying a
moving median or kernel smoothing). smooth
controls both
the tolerated deviance and the size of the window for calculating a
moving median. smooth = 1
(the default) corresponds to a
window of ~100 ms and tolerated deviation of ~4 semitones. Smoothing can
be applied to any measured value, not only the final pitch contour; the
default is smoothVars = c('pitch', 'dom')
. To turn off
median smoothing, set smooth = 0
or
smoothVars = NULL
.
a1 = analyze(s1, samplingRate = 16000, priorMean = NA,
pitchMethods = 'cep', pitchCep = list(cepThres = .7), nCands = 3,
pathfinding = 'none', snakeStep = 0, interpolTol = Inf,
smooth = 0, # no smoothing
summaryFun = NULL, plot = FALSE)
a1$detailed$medianSmooth = soundgen:::medianSmoother(
data.frame(pitch = a1$detailed$pitch), smoothing_ww = 3, smoothingThres = 1
)$pitch
plot(pitch ~ time, data = a1$detailed, type = 'l', main = 'Median smoothing',
xlab = 'Time, ms', ylab = 'Pitch, Hz')
points(medianSmooth ~ time, data = a1$detailed, type = 'l',
col = 'red', lty = 3, lwd = 2)
# dotted line = with median smoothing
TIP Pathfinding is the only postprocessing module that does not deviate from pitch candidates actually returned by pitch tracking algorithms. Interpolation, snake, and median smoothing produce new pitch values per frame, which may be quite different from any actual candidates
Naturally, the extracted pitch contours can then be smoothed outside
analyze()
using any algorithm you like, such as GAM,
low-pass filter, etc. For users of PRAAT, there is a function that
implements its smoothing algorithm:
a1$detailed$ps = pitchSmoothPraat(
a1$detailed$pitch,
bandwidth = 5, # cutoff above 5 Hz (5 pitch values per s)
samplingRate = 1000/25) # 1/step = number of pitch values per s
plot(pitch ~ time, data = a1$detailed, type = 'l', main = 'Low-pass smoothing',
xlab = 'Time, ms', ylab = 'Pitch, Hz')
points(ps ~ time, data = a1$detailed, type = 'l', col = 'red', lty = 3, lwd = 2)
# dotted line = after low-pass smoothing
When analyzing a sound, and even when batch-processing an entire
folder, it is often helpful to plot both the final pitch contour -
perhaps overlaid on a spectrogram - and individual pitch candidates. You
can easily do so from analyze
, as in all the examples
above. The default plotting parameters can also be customized, for
example:
a = analyze(
s1, samplingRate = 16000, plot = TRUE, priorMean = NA,
# options for spectrogram(): see ?spectrogram
xlab = 'Time (ms)',
main = 'My spectrogram',
dynamicRange = 90,
contrast = .5,
brightness = -0.3,
colorTheme = 'seewave',
ylim = c(0, 4),
# + other pars passed to soundgen:::filled.contour.mod()
# options for oscillogram
osc = 'dB',
heights = c(3, 1),
# options for plotting the final pitch contour (line)
pitchPlot = list(
col = 'black',
lwd = 5,
lty = 3
# + other pars passed to base::lines()
),
# options for plotting pitch candidates (points)
pitchAutocor = list(col = rgb(0, 1, 0, .5), pch = 16, cex = 2),
pitchDom = list(col = 'red', cex = 4)
)
You can also suppress plotting any of these three components: the
spectrogram, the final pitch contour, or individual pitch candidates. To
plot pitch candidates but not the spectrogram, set
brightness = 1
. To suppress plotting the pitch contour, set
pitchPlot = list(lwd = 0)
. To suppress plotting pitch
candidates, set their cex to 0, eg
pitchAutocor = list(cex = 0)
.
To save the plots, specify a valid path (’’ = the same folder as where input audio files live). This is mostly useful when you are batch-processing an entire folder full of audio files and want to save the plots for manual checking of extracted plot contours. Note that the file ‘00_clickablePlots_analyze.html’ offers an easy way to inspect all the plots together in a web browser, and clicking each plot plays the corresponding audio. This works even if the plots are not saved in the same folder as audio.
TIP Processing time varies a lot depending on the exact settings
and input, but expect up to a few seconds of machine time per second of
audio. The surest way to speed things up is to reduce
samplingRate
(but not too low, otherwise the resolution of
pitch estimation begins to suffer) and to use a relatively short
windowLength
with little overlap
. Also avoid
pathfinding = 'slow'
(other types of postprocessing have
very little effect on processing time)
Just like soundgen_app()
is an interactive shiny app for
sound synthesis, pitch_app()
provides a wrapper around
analyze()
with an option to step in and manually correct
the intonation contour. The app runs in a browser; you can load one or
more wav or mp3 files, adjust the settings, verify and correct the pitch
contours, and save the output as a .csv file. Please see
?pitch_app
for more information.
TIP pitch_app()
is meant for extracting pitch
contours. Although its output is similar to that of
analyze()
, it doesn’t offer options for measuring formants,
loudness, etc. Suggested workflow: extract manually corrected pitch
contours first, and then run
analyze(..., pitchManual = output.csv)
, where output.csv is
the dataset produced by pitch_app
By default, analyze()
and pitch_app()
summarize contours of estimated pitch and other variables using simple
summary functions such as mean, median, SD, etc. These summaries can be
extended with user-defined summaryFun
, but then the same
descriptives are applied to every single output of
analyze()
, potentially resulting in very large output
matrices with a lot of descriptives no-one is ever going to need. To
avoid this clutter, it can be convenient to summarize one variable at a
time - normally pitch - and extract a much larger number of summary
measures of the average value, its variability, slope, inflections, etc.
This task can be accomplished with pitchDescriptives()
. For
example:
pd = pitchDescriptives(
a1$detailed$pitch, step = a1$detailed$time[2] - a1$detailed$time[1],
timeUnit = 'ms',
smoothBW = c(10, 1), # original + smoothed at 10 Hz and 1 Hz
inflThres = .2, # min amplitude that counts as inflections
plot = TRUE
)
## [1] "Completed 1 iterations in 7 ms."
## [1] "file" "duration" "durDefined"
## [4] "propDefined" "start_10" "start_oct_10"
## [7] "end_10" "end_oct_10" "mean_10"
## [10] "mean_oct_10" "median_10" "median_oct_10"
## [13] "min_10" "min_oct_10" "time_min_10"
## [16] "max_10" "max_oct_10" "time_max_10"
## [19] "range_10" "range_sem_10" "sd_10"
## [22] "sd_sem_10" "CV_10" "meanSlope_10"
## [25] "meanSlope_sem_10" "meanAbsSlope_10" "meanAbsSlope_sem_10"
## [28] "maxAbsSlope_10" "maxAbsSlope_sem_10" "inflex_10"
## [31] "start_1" "start_oct_1" "end_1"
## [34] "end_oct_1" "mean_1" "mean_oct_1"
## [37] "median_1" "median_oct_1" "min_1"
## [40] "min_oct_1" "time_min_1" "max_1"
## [43] "max_oct_1" "time_max_1" "range_1"
## [46] "range_sem_1" "sd_1" "sd_sem_1"
## [49] "CV_1" "meanSlope_1" "meanSlope_sem_1"
## [52] "meanAbsSlope_1" "meanAbsSlope_sem_1" "maxAbsSlope_1"
## [55] "maxAbsSlope_sem_1" "inflex_1"
Formants are the natural resonances of the vocal tract - that is,
frequency bands that are amplified as the sound passes through the vocal
tract. In speech, formants are responsible for differences between vowel
sounds such as /a/, /i/, /u/, etc., but formants are also present in
many animal calls. The most common way to estimate formant frequencies
and bandwidth is by using Linear Predictive Coding (LPC), which is
implemented in phonTools::findformants()
. Because formant
measurement is quite error-prone, however, any serious analysis should
include manual verification of LPC measurements. Soundgen offers a web
app for this task: formant_app()
. Just like
pitch_app()
, it runs in the system-default web browser; you
can add annotations to audio files and manually verify and, if needed,
correct the formant measurements for each annotated segment.
TIP: linear predictive coding (LPC) is implemented in
phonTools::lpc
and phonTools::findformants
. As
a convenience, soundgen::analyze
shows the output of
phonTools::findformants
, but for serious formant analysis
you might want to use formant_app or another interactive program like
PRAAT and check everything manually. Also note that pitch_app and
formant_app are meant to work in the Firefox browser. A bug in Chrome
interferes with correct audio playback; Safari and IE may or may not
work.
In addition to measuring spectral characteristics and fundamental frequency, it is often important to analyze the temporal structure of a sound. In particular, it is often helpful to divide a sound into separate “syllables” - continuous acoustic fragments separated by what we consider to be “silence”. If this “silence” was not full of background noise and breathing, the task would be trivial: we could simply define syllables as continuous segments with ampiltude above some threshold. As it is, with non-studio material it is problematic to find a single threshold that would accurately coincide with the beginning and end of syllables in different sounds (presuming that we are interested in batch processing multiple sounds without manually adjusting the settings for each sound). Soundgen attempts to solve this problem by using an adaptive algorithm sensitive to the nature of background noise, signal-to-noise ratio, and acoustic novelty to achieve flexible and fast syllable segmentation.
Sometimes we are more interested in the rate of syllables per second and in their regularity, rather than the absolute duration of each syllable. In this case it makes more sense to look for bursts of acoustic energy or sudden changes relative to background noise. The spacing between bursts - the interburst interval - is a perceptually salient characteristic of a bout of vocalizing.
Soundgen package contains a function called segment
,
which uses both these approaches: it looks for continuous syllables and
for acoustic bursts. These two algorithms are not independent: syllables
are found first, and then the median length of a syllable becomes the
expected interburst interval, guiding burst detection.
segment
operates with ordinary or mel-transformed
spectrograms or with amplitude envelopes - smoothed contours of sound
intensity.
# for info on using soundgen() function, see the vignette on sound synthesis
s2 = soundgen(nSyl = 8, sylLen = 50, pauseLen = 70, temperature = 0,
pitch = c(368, 284), amplGlobal = c(0, -20))
# add noise so SNR decreases from 20 to 0 dB from syl1 to syl8
s2 = s2 + runif(length(s2), -10 ^ (-20 / 20), 10 ^ (-20 / 20))
# playme(s2, samplingRate = 16000)
a = segment(s2, samplingRate = 16000, plot = TRUE)
## propNoise set to 0.417; reset manually if needed
## SNR set to 2.7; reset manually if needed
This synthesized laugh-like sound contains 8 syllables, each 50 ms
long and separated by 70 ms of unvoiced fragments with some overlapping
aspiration noise (NOT silence!). With default settings,
segment
finds 7 syllables of median length
median(a$syllables$sylLen, na.rm = TRUE)
= 48 ms separated
by median(a$syllables$pauseLen, na.rm = TRUE)
= 72 ms. The
syllables are shown by blue line segments in the plot above. The last
syllables are missed either because they are below the default detection
threshold SNR
or because their apparent duration falls
below the default shortestSyl
of 40 ms. All 8 bursts are
correctly detected. The interburst interval is estimated to be
median(a$bursts$interburst, na.rm = TRUE)
= 120 ms (the
correct number is 70 + 50 = 120 ms), with SD =
sd(a$bursts$interburst, na.rm = TRUE)
= 3 ms (the correct
number is 0 since we set temperature to 0). Note that, just as with many
real-life sounds, the question of when each syllable starts and ends is
pretty subjective, and segmentation can be performed on different time
scales, eg looking for bouts of laughter surrounded by noise or for
individual syllables within each laugh.
Some other settings worth mentioning are:
propNoise
, SNR
: two key settings that
enable the algorithm to separate signal from noise. If you know how much
of your recording(s) is likely to be noise, specify “propNoise”, then
use “SNR” to fine-tune the balance between false negatives (missing
faint signals) and false positives (hallucinating signals that are
actually noise).## propNoise set to 0.417; reset manually if needed
a2 = suppressWarnings(segment(s2, samplingRate = 16000, plot = TRUE,
SNR = 14, main = 'SNR very high'))
## propNoise set to 0.417; reset manually if needed
windowLength
, overlap
: length (ms) and
overlap (%) of the smoothing window used to produce the signal envelope.
Setting the overlap too low makes the enveloped jagged and imprecise,
while setting the window length too low produces insufficient
smoothing:a1 = segment(s2, samplingRate = 16000, plot = TRUE,
windowLength = 40, overlap = 0, main = 'overlap too low')
## propNoise set to 0.481; reset manually if needed
## SNR set to 3.8; reset manually if needed
a2 = suppressWarnings(segment(s2, samplingRate = 16000, plot = TRUE,
windowLength = 5, overlap = 80, main = 'window too short'))
## propNoise set to 0.313; reset manually if needed
## SNR set to 1.6; reset manually if needed
a3 = segment(s2, samplingRate = 16000, plot = TRUE,
windowLength = 150, overlap = 80, main = 'window too long')
## propNoise set to 0.344; reset manually if needed
## SNR set to 3.6; reset manually if needed
shortestSyl
and shortestPause
incorporate
our prior knowledge by expecting the syllables and pauses between them
to be at least 40 ms long (by default). Setting shortestSyl
and shortestPause
too low may inflate the number of
discovered syllables, but burst detection should not be affected as
much. If shortestSyl
is too high, excluding shorter
fragments, we won’t find any syllables, but burst detection can still
succeed. The most damaging mistake is to set shortestPause
too high, because separate syllables are then merged and the expected
interburst interval becomes inflated, preventing the algorithm from
recognizing closely spaced bursts (see the example below).# too long, but at least bursts are detected
a1 = segment(s2, samplingRate = 16000, plot = TRUE,
shortestSyl = 80, main = 'shortestSyl too long')
## propNoise set to 0.417; reset manually if needed
## SNR set to 2.7; reset manually if needed
# merges syllables
a2 = segment(s2, samplingRate = 16000, plot = TRUE,
shortestPause = 80, main = 'shortestPause too long')
## propNoise set to 0.417; reset manually if needed
## SNR set to 2.7; reset manually if needed
peakToTrough
controls how high a burst has to be
compared to the local minimum. The size of the analysis window for
finding local maxima and minima is controlled by
interburst
, which defaults to the median length of detected
syllables + pause between them. By default the local minimum is
calculated on both left and right sides of the candidate burst
(troughLocation = 'either'
). The left-right distinction is
irrelevant with this artificial, perfectly symmetrical example, but
setting troughLocation = 'left'
may improve performance
with real-life sounds with an assymmetric attack (sharp onset and
gradual decay).TIP For batch processing without manual adjustments, and particularly when the focus is on describing the overall rate of acoustic events rather than on actually segmenting a recording into separate vocalizations, bursts may be more robust than syllables
The digital representation of a sound is a long vector of numbers on some arbitrary scale, say [-1, 1]. Values further from zero correspond to a higher amplitude - in physical terms, to greater pertubations of sound pressure level caused by the propagating sound wave. A smoothed line following peak amplitude values is known as an amplitude envelope. However, there is no simple correspondence between the absolute height of amplitude peaks and the subjectively experienced loudness of the corresponding sound. A commonly reported measure of sound intensity is its root mean square (RMS) amplitude, which takes into account the average value of sound pressure, and not only the height of peaks. More sophisticated estimates of loudness also take into account the relative sensitivity of human hearing to different frequencies, masking of adjacent tones in the time and frequency domains, etc.
To illustrate the differences between these estimates, let’s look at a pure tone sweeping with fixed absolute amplitude from 100 to 4000 Hz over 2 s:
dur = .5 # .5 s duration
samplingRate = 16000
f0 = getSmoothContour(c(70, 8000), len = samplingRate * dur, thisIsPitch = TRUE)
sweep = sin(2 * pi * cumsum(f0) / samplingRate)
# playme(sweep)
# spectrogram(sweep, 16000)
# plot(sweep, type = 'l')
Compare the flat contour of RMS amplitude per STFT frame
vs. loudness, both returned by analyze()
:
a = analyze(sweep, samplingRate = samplingRate, pitchMethods = NULL, plot = FALSE)
par(mfrow = c(1, 2))
plot(a$detailed$time, a$detailed$ampl, type = 'l', ylim = c(0, 1), main = 'RMS')
plot(a$detailed$time, a$detailed$loudness, type = 'l', main = 'Loudness')
Soundgen also has a dedicated function for calculating the loudness
and plotting the output, getLoudness()
. Loudness values are
overlaid on the spectrogram - observe how the loudness peaks as f0
reaches about 2-3 kHz and then drops. The absolute values in sone are
only an approximation, since they are dictated by the playback device
(e.g. your headphones), but the change of loudness within one sound, or
across different sounds analyzed with the same settings, is
informative.
Acoustic analysis in the context of phonetic or bioacoustic research tends to begin with a spectrogram. However, there is a complementary perspective on visualizing and analyzing sounds, which looks at the joint distribution of temporal and spectral modulation known as “modulation spectrum”. The principal insight is the idea that a sound can be represented as a sum of sinusoidal spectrotemporal ripples. In case this concept is difficult to grasp, here are two intuitive explanations (for a rigourous scientific presentation, please refer to Singh & Theunissen, 2003). The first way to understand the modulation spectrum is to think of it as a two-dimensional Fourier transform of the spectrogram. For those familiar with image processing, this is the standard way of representing any two-dimensional image as a sum of sinusoidal components (ripples) that differ in their frequency and direction. When the image in question is a spectro-temporal representation of a sound, such as a spectrogram or wavelet transform, the result is the modulation spectrum. The second basic insight is to link the modulation spectrum to spectro-temporal receptive fields (STRFs) of auditory neurons. Some neurons respond to rapid upward sweeps, others to broadband noise pulses at a particular frequency, and so on. The modulation spectrum maps nicely onto STRFs, in the sense that it shows which spectro-temporal patterns are prevalent in a particular sound.
The basic function for obtaining a spectro-temporal modulation
spectrum in soundgen is modulationSpectrum
. It accepts one
or more numeric vectors or one or more audio files as input, prepares a
spectrogram via STFT or filter banks via audSpectrogram(), takes its
two-dimensional FFT, and optionally performs some additional
post-processing and plotting. For multiple inputs, a separate modulation
spectrum is initially extracted for each one, and then they are averaged
into a single output. For processing a number of audio files separately
and saving their individual modulation spectra, please use the function
modulationSpectrumFolder
. An alternative method is to take
a 1D FFT of each frequency band, which makes sense if we are mostly
interested in temporal modulation (e.g., for estimating roughness).
The first and crucial thing to remember when working with modulation spectra produced via STFT is that they are extremely sensitive to the initial step of transforming a sound into a spectrogram, especially the window length and step used in STFT. In order to detect temporal modulation at 20 Hz, for example, the step of a spectrogram has to be under 50 ms (1/20 s). Likewise, spectral modulation of 10 cycles/KHz cannot be resolved unless the window length is long enough (~20 ms). Here is a simple illustration:
s = soundgen(pitch = 70, amFreq = 25, amDep = 80, rolloff = -15)
ms = modulationSpectrum(s, samplingRate = 16000,
windowLength = 25, step = 25, amRes = NULL)
## Warning in .modulationSpectrum(audio = list(sound = c(0, 0, 0, 0, 0, 0, :
## roughRange outside the analyzed range of temporal modulation frequencies;
## increase overlap / decrease step to improve temporal resolution, or else look
## for roughness in a lower range
In this example we missed both spectral modulation at ~14 cycles/KHz corresponding to the f0 of 70 Hz (1000/70 = 14.3) and temporal modulation at 25 Hz. To increase the resolution in both directions, we can simultaneously increase the window length (to capture spectral modulations) and decrease STFT step (to capture temporal modulations):
Now we can see both spectral modulations at 14 Hz (f0) and temporal modulations at 25 Hz, as expected. When working with sounds for which you don’t know the ground truth, however, the point is not to take the first modulation spectrum you happen to produce at face value - play around with different settings and compare the results. This is really the same indeterminacy principle as the more familiar tradeoff between frequency and time resolution in spectrograms.
Beyond these basic settings, there are several other options:
ms = modulationSpectrum(
s, samplingRate = 16000, windowLength = 15, step = 5,
specSource = 'STFT', # STFT or auditory spectrogram
msType = '2D', # 2D = joint spectro-temporal modulations
amRes = NULL, # analyze the entire sound at once (see TIP below)
logSpec = FALSE, # log-transform the spectrogram before 2D FFT?
power = 2, # square amplitudes in modulation spectrum ("power" spectrum)
roughRange = c(30, 150), # temporal modulations in the "roughness" range
amRange = c(10, 70), # AM frequencies of interest
logWarpX = c(2, 2), # log-transform the temporal modulations axis for plotting
kernelSize = 7, # apply Gaussian blur for smoothing
quantiles = c(.5, .8, .95, .99), # customize contour lines
colorTheme = 'terrain.colors' # alternative palette
)
ms[c('roughness', 'amMsFreq', 'amMsPurity')]
## $roughness
## [1] 13.8419
##
## $amMsFreq
## [1] 25.25253
##
## $amMsPurity
## [1] -16.96504
TIP Roughness and amplitude modulation can be calculated for the
entire sound (set amRes = NULL
) or for as many frames as
possible given the duration of audio and the desired resolution for the
analysis of temporal modulation (one frame for the analysis of roughness
comprises several STFT frames, so we measure about amRes
values per second)
The basic idea behind self-similarity matrices (SSMs) is to compare
different parts of the same sound with each other and present their
pairwise similarity indices as a square matrix with time along both
dimensions. The diagonal going from bottom-left to top-right represents
the similarity of each fragment with itself. For example, at (100, 100)
we have the similarity of the fragment at 100 ms with itself, while at
(100, 200) we have the similarity of the fragments at 100 ms and 200 ms.
Let’s begin with an example taken from ?ssm
:
s3 = c(soundgen(), soundgen(nSyl = 4, sylLen = 50, pauseLen = 70,
formants = NA, pitch = c(500, 330)))
# playme(s3, 16000)
m = ssm(s3, samplingRate = 16000)
Look at the upper panel first. There are three main regions: the large red square in the lower left corner, which corresponds to the first bout; the red square in the center, which corresponds to the long pause between the bouts; and the checkered field in the upper right corner, which corresponds to the second bout. Observe the characteristic diagonal. The SSM clearly shows periodicity within the second bout, which confirms that these syllables are very similar to each other in terms of their spectral characteristics. Now look at the lower panel of the plot. This is a spectrogram overlaid with so-called “novelty”: peaks show when something changes in the sound, based on the SSM.
The results of SSM analysis can vary dramatically, depending on how it is performed. We begin by extracting some form of spectrogram of the entire sound, then we somehow compare each frame to each other frame, and finally we calculate a measure of novelty. At each step there are crucial decisions to make, which can dramatically alter the result. Let’s look at the settings that control each step. As you play with these settings, don’t be surprised if your SSM suddenly looks completely different!
Spectrogram extraction is controlled by some already familiar
settings such as windowLength
and overlap
(or
step
if you prefer). However, it can often be helpful to
base the SSM on some other spectral characteristics than the usual
spectrogram. Under the hood, ssm
calls
tuneR::melfcc
to return the power spectrum (as in a
spectrogram, input = 'spectrum'
), mel-transformed spectrum
(auditory spectrum, which is supposed to be closer to how humans
perceive sounds, input = 'audiogram'
), or mel-frequency
cepstral coefficients (MFCCs - a popular choice of input for automatic
sound classification, as in speech recognition software,
input = 'mfcc'
). There are also several arguments that are
passed to tuneR::melfcc
: maxFreq
controls the
highest frequency analyzed, MFCC
controls the number of
mel-frequency cepstral coefficients to extract, and nBands
basically controls the frequency resolution.
SSM calculation starts with the spectrogram (or the MFCCs matrix)
produced at the previous stage. This input is divided into windows, each
of which can be one or several STFT frames in length, and then each
window is compared to each other window using either cosine similarity
(simil = 'cosine'
) or Pearson’s correlation
(simil = 'cor'
). Prior to this the input matrix can be
normalized column-wise, so as to level out the differences in sound
intensity across frames (norm = TRUE
). The size of analysis
window is controlled by the argument ssmWin
, ms. Note that
as you increase ssmWin
, the resulting SSM literally shrinks
in size, reducing its resolution. This is a great feature if your sound
is very long, but it is seldom needed for shorter sounds. As you
decrease ssmWin
, it ultimately reaches the size of a single
STFT frame (step
) and can’t go down any further, unless you
improve time resolution of the spectrogram itself by reducing
windowLength
and/or increasing
overlap
.
Novelty calculation starts with the SSM computed at the previous
step. Here the idea is to move along the diagonal and multiply the SSM
by a Gaussian checkerboard matrix, so as to detect changes in the
spectral structure (see Foote, 2000). The settings that control this
algorithm are kernelLen
and kernelSD
. The
crucial parameter is kernelLen
(ms): a large kernel is good
for detecting global changes in the sound, such as transitions between
different vocalizations, whereas a small kernel is good for detecting
rapid changes, such as individual bursts in a laugh. If you are curious
about the kernel itself, take a look at
?soundgen:::getCheckerboardKernel()
.
par(mfrow = c(2, 1))
m1 = ssm(s3, samplingRate = 16000,
input = 'melspec', simil = 'cor', norm = FALSE,
ssmWin = 10, kernelLen = 150) # detailed, local features
m2 = ssm(s3, samplingRate = 16000,
input = 'mfcc', simil = 'cosine', norm = TRUE,
ssmWin = 50, kernelLen = 600) # more global
par(mfrow = c(1, 1))
TIP analyze() internally calls ssm() to calculate novelty
The results of acoustic analysis are sensitive to the chosen
settings: the size and overlap of Fourier windows, the method(s) of
pitch tracking and their parameters, etc. The default settings of the
two main functions presented in this vignette, analyze
and
segment
, have been optimized for human non-linguistic
vocalizations. If you analyze other types of sounds, such as human
speech or whale songs, you can probably improve the accuracy by changing
some settings. In many cases the required changes are obvious: for
example, to analyze f0 in ultrasonic whistles of dolphins or rodents you
will obviously want to record at a much higher sampling rate, use
shorter FFT windows, raise pitch_ceiling
and / or
prior_mean
, etc. Other settings are not so obvious or even
downright opaque, inviting some form of automatic optimization.
All that you need in order to run optimization is a training sample - that is, a few hundred sounds for which you know the right answer (average pitch, the number of syllables, or whatever it is that you wish to measure). You can start by analyzing this training sample with some reasonable settings, and then you can check the measurements manually, correcting any mistakes. This gives you a “key”, and then you can fiddle with the settings trying to reproduce this key as closely as possible, but now without manual interventions. Once you have found the optimal settings, you can use them to analyze future samples of acoustically similar material. More sophisticated methods of optimization involving cross-validation are also possible, but even the simplest check against manual measurements can cause a dramatic improvement in the quality of your measurements.
R has excellent optimization tools, notably
stats::optim
. It can also be done manually; for example,
you can specify a grid with combinations of several control parameters
and repeat the analysis for each combination, each time comparing the
result with your “key”. Since I have gone through this process when
optimizing the default settings in analyze
and
segment
, I believe it may be helpful to share some code and
tips specific to this particular optimization problem.
Soundgen package contains two vectors containing “keys” for the
optimization of segmentation (segmentManual
) and pitch
tracking (pitchManual
) of 260 sounds in the corpus of human
vocalizations described in Anikin & Persson (2017).
segmentManual
contains manually verified syllable counts,
and pitchManual
the average pitch of these 260 sounds.
These numbers are by no means intended to represent the absolute truth:
in many cases it is very hard to objectively count the syllables or even
determine the “true” pitch (if you doubt it, listen to the sounds, which
can be downloaded from
http://cogsci.se/publications/2017_corpus.html).
Nevertheless, these manual measurements should not be too far off the
mark, so they are acceptable targets for training the algorithm.
The easiest, “out-of-the-box” solution is to use the soundgen
function optimizePars
, which is essentially a wrapper
around stats::optim
tailored to this particular task. There
are extensive examples in the documentation for this function: see
?optimizePars
.
TIP The parameters of segment
can be optimized
within an hour or two with 260 sounds (total audio duration ~9 min), but
pitch tracking is really too slow to be optimized head-on, except maybe
for one or two parameters at a time using grid optimization instead of
optim
If you prefer to optimize manually, without calling
opitmizePars
, here are a couple of examples.
# checking combinations of pitch tracking methods
myfolder = 'path.to.260.wav.files'
key = log(pitchManual)
p = c('autocor', 'cep', 'spec', 'dom')
pp = c(list(p),
combn(p, 3, simplify = FALSE),
combn(p, 2, simplify = FALSE),
combn(p, 1, simplify = FALSE))
out = list()
res = data.frame('pars' = sapply(pp, function(x) paste(x, collapse = ',')),
cor1 = rep(NA, length(pp)),
cor2 = rep(NA, length(pp)))
# repeating the analysis for each combination of methods in pp
for (i in 1:length(pp)) {
out[[i]] = analyze(myfolder, plot = FALSE, step = 50,
pitchMethods = pp[[i]])$summary$pitch_median
res$cor1[i] = cor(log(out[[i]]), log(pitchManual), use = 'pairwise.complete.obs')
res$cor2[i] = cor(log(out[[i]]), log(pitchManual), use = 'pairwise.complete.obs') *
(1 - mean(is.na(out[[i]]) & !is.na(key)))
print(res[i, ])
}
res[order(res$cor1, decreasing = TRUE), ] # max correlation regardless of NA
res[order(res$cor2, decreasing = TRUE), ] # max correlation penalized for NA
And another example, for a grid of two parameters of
analyze
:
myfolder = 'path.to.260.wav.files'
key = log(pitchManual)
out = list()
pars = expand.grid(windowLength = c(17, 35, 50),
smooth = c(0, 1, 2))
for (i in 1:nrow(pars)) {
out[[i]] = suppressWarnings(analyze(myfolder, plot = FALSE,
step = 25,
pitchMethods = c('autocor','dom'),
windowLength = pars$windowLength[i],
smooth = pars$smooth[i]))$summary$pitch_median
print(cor(log(out[[i]]), key, use = 'pairwise.complete.obs'))
print(cor(log(out[[i]]), key, use = 'pairwise.complete.obs') *
(1 - mean(is.na(out[[i]]) & !is.na(key))))
}
pars$r1 = sapply(out, function(x) {
cor(log(x), key, use = 'pairwise.complete.obs')
})
pars$r2 = sapply(out, function(x) {
cor(log(x), key, use = 'pairwise.complete.obs') *
(1 - mean(is.na(x) & !is.na(key)))
})
pars
v = 6 # pick some combination of par values to explore
trial = log(out[[v]])
cor(key, trial, use = 'pairwise.complete.obs')
cor(key, trial, use = 'pairwise.complete.obs') *
(1 - mean(is.na(trial) & !is.na(key)))
plot (key, trial)
abline(a=0, b=1, col='red')
Anikin, A. & Persson, T. (2017). Non-linguistic vocalizations from online amateur videos for emotion research: a validated corpus. Behavior Research Methods, 49(2): 758-771. Text and sounds available here
Ba, H., Yang, N., Demirkol, I., & Heinzelman, W. (2012, August). BaNa: A hybrid approach for noise resilient pitch detection. In Statistical Signal Processing Workshop (SSP), 2012 IEEE (pp. 369-372).
Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences (Vol. 17, No. 1193, pp. 97-110).
Foote, J. (2000). “Automatic Audio Segmentation using a measure of audio novelty.” In Proceedings of IEEE International Conference on Multimedia and Expo, vol. I, pp. 452-455.
Singh, N. C., & Theunissen, F. E. (2003). Modulation spectra of natural sounds and ethological theories of auditory processing. The Journal of the Acoustical Society of America, 114(6), 3394-3411.
Sueur, J. (2018). Sound analysis and synthesis with R. Heidelberg, Germany: Springer.