2 Example 1: breathy moan

Let’s take this same moan as the first walk-through example.

2.1 Know thy enemy

First let’s take a long, hard look at the sound we are trying to recreate. Open it in some interactive audio editor. I prefer Audacity, but any program will do, as long as it supports both waveform and spectrogram views. In Audacity, switch to spectrogram view (window size 1024 is about right for a sound with a sampling rate of 44100 Hz - ~23 ms, narrow-band spectrogram), zoom in on the most interesting region of the frequency axis (here 0 to 8 kHz), and adjust the time zoom level to your liking. Here is our target, then! Let’s listen to it a few times and examine the spectrogram and oscillogram:

library(soundgen)

## Loading required package: shinyBS

## Soundgen 1.8.0.9000. Tips & demos on project's homepage: http://cogsci.se/soundgen.html

spectrogram('files/moan_180.wav', ylim = c(0, 8),
            osc_dB = T, heights = c(1, 1))

# Download link: cogsci.se/soundgen/matching/files/moan_180.wav

The first thing to notice is that this moan consists of two parts: the first ~400 ms are voiced (note the clearly visible harmonics), and then for ~300 ms we have a voiceless breathing sound (only formants are visible - no harmonics). These visible formants in the voiceless fragment are going to make our job much easier. In fact, we have two causes for celebration: (1) the formant structure is roughly comparable in the voiced and voiceless fragments (i.e., we are dealing with a relaxed [h] sound originating close to the glottis), so it is not absolutely necessary to specify the formants separately for the voiced and voiceless components; and (2) while formants may be hard to detect in purely harmonic sounds, the presence of spectral noise makes them much more prominent.

2.2 Temporal structure

With such a simple target vocalization, we can probably create a decent synthetic approximation in one go, without breaking it into several separately synthesized bouts, overlaid acoustic layers, or post-processing. A good way to start is to measure syllable length - separately for the voiced and unvoiced component. The length of the voiced component is about 360-380 ms (we can add a bit for the fade-in/out):

TIP: switch to waveform view in Audacity to make precise temporal measurements, don’t trust the spectrogram! You can go back and forth between waveform and spectrogram to make sure you have identified the right fragment

Since the sylLen parameter in soundgen regulates the length of voiced fragments only, we set it to 380 ms, and then we use the voice parameter to show when turbulent noise (breathing) should be added. The unvoiced fragment seems to start about 20 ms before the voiced fragment, but we’ll ignore it, since it could be an artifact. There is a burst of noise on the spectrogram as the voicing begins, but it seems to be related to irregularities in the voiced component, not breathing. Then the sound is mostly tonal, but the breathing noise grows gradually towards the end of voiced fragment, jumps rather abruptly to about -20 dB at 350-360 ms, and then there is a linear fade-out to -40 dB at 360 + 340 = 700 ms. Where did the numbers -20 and -40 dB come from? From a visual inspection of the waveform on a dB scale on the oscillogram, plotted directly in R or in Audacity:

osc_dB('files/moan_180.wav')

## osc_dB is deprecated; please use osc(dB = TRUE) isntead

abline(h = 60, col = 'blue')
abline(h = 40, col = 'red')

For now, we can use these values in dB - and a bit of guesswork - for setting the preliminary noise. We can always readjust them later, once other settings have been finalized. Finally, the sound starts with the release of a glottal stop, making the attack very sharp. So the first approximation might be something like this:

s1_1 = soundgen(
  # MISC
  temperature = .001,  # for reproducibility in this example
  addSilence = 0,  # easier to synchronize with target
  samplingRate = 44100,  # same as in target for ease of comparison
  play = T, plot = T,   # play & plot the result
  osc_dB = T, heights = c(1, 1), ylim = c(0, 8),  # plotting pars
  
  # TEMPORAL
  sylLen = 380,  # length of the voiced fragment
  noise = list(
    time = c(0, 250, 400, 680),    # noise amplitude defined at these time points
    value = c(-35, -25, -20, -40)  # noise amplitude, dB
  ),  
  attackLen = c(15, 50)  # 15 ms at onset, default 50 ms at offset
)

# seewave::savewav(s1_1, f = 44100, filename = 's1_1.wav')
# system('ffmpeg -y -i s1_1.wav files/s1_1.ogg')

Not too bad for a start. But why does that sound like a man? Well, we didn’t specify an intonation contour and formants, so soundgen falls back on the default values. And these default values happen to be based on a recording of the voice of the developer, who happens to be a man. To sound like the woman in the original recording, we have to match her intonation (fundamental frequency, pitch) and formant structure (transfer function of the vocal tract).

TIP: don’t be in a hurry to use the ampl parameter in order to force the synthetic sound into the right amplitude envelope. Set formant transitions and dynamic rolloff changes first, because the envelope also depends on them

2.3 Intonation (pitch)

I discuss some methods of automatic pitch tracking - and its perils - in the vignette on acoustic analysis. You are welcome to use your favorite pitch tracker, like the one in PRAAT, or run soundgen::analyze() and get pitch values from its output. As before, I believe a manual approach is safer and ultimately faster. For example, in this case automatic algorithms may struggle with the initial rapid rise of f0:

a = analyze('files/moan_180.wav', 
            windowLength = 30, overlap = 80,
            pitchMethods = c('autocor', 'dom'),
            ylim = c(0, 2))
a$pitch

Instead of trying to fine-tune pitch trackers, we can just estimate a few values of the fundamental frequency (f0) - enough to define a reasonable approximation of the original curve - directly from the spectrogram. Say, we might begin by finding the approximate peak of f0. In this case, it seems like harmonics peak at about 100 ms. In Audacity, we make a small selection around this time point and click “Analyze / Plot spectrum…” to obtain a spectral slice:

Harmonics are clearly visible, and the frequency of the first one is about 313 Hz. If we want more precision or if f0 is masked by noise, we can measure the average distance between any two adjacent harmonics or take a higher harmonic and divide by its number. For example, 4 * f0 is 1249 Hz, so f0 is about 312 Hz:

In any case, f0 peaks at about 312 Hz, and we can similarly estimate its values at a few other time points and then plug these values into our initial call to soundgen defined above, for example:

s1_2 = soundgen(
  # MISC
  temperature = .001,  # for reproducibility in this example
  addSilence = 0,  # easier to synchronize with target
  samplingRate = 44100,  # same as in target for ease of comparison
  play = T, plot = T,   # play & plot the result
  osc_dB = T, heights = c(1, 1), ylim = c(0, 8),  # plotting pars
  
  # TEMPORAL
  sylLen = 380,  # length of the voiced fragment
  noise = list(
    time = c(0, 250, 400, 680),    # noise amplitude defined at these time points
    value = c(-35, -25, -20, -40)  # noise amplitude, dB
  ),  
  attackLen = c(15, 50),  # 15 ms at onset, default 50 ms at offset
  
  # PITCH
  pitch = list(time = c(0, 85, 320, 380),
               value = c(280, 310, 240, 250))
)

# seewave::savewav(s1_2, f = 44100, filename = 's1_2.wav')
# system('ffmpeg -y -i s1_2.wav files/s1_2.ogg')

We are getting closer. The intonation is very similar to the original, but the vowel quality is not right - more like an [a], while the target is more like an [o]. To improve the result, the next step is to estimate the speaker’s formant frequencies.

2.4 Filter (formants)

The standard method of measuring formant frequencies is called linear predictive coding (LPC). As with pitch, feel free to experiment with your favorite algorithm, say PRAAT or, working in R, phonTools::findformants(), which is also the method used by soundgen::analyze():

a = analyze('files/moan_180.wav', nFormants = 7, pitchMethods = NULL, plot = FALSE)
columns = c('f1_freq', 'f2_freq', 'f3_freq', 'f4_freq', 'f5_freq', 'f6_freq', 'f7_freq')
ff = round(as.vector(apply(a[, columns], 2, median, na.rm = TRUE)))
cat(ff)

## 767 1245 2960 3966 4710 6073 7261

In this case the result is actually reasonable and very close to the values I would pick manually, although we are ignoring formant transitions visible on the spectrogram and just take the median values. Very often, however, formants are either hard to detect or highly variable. I don’t know any foolproof way to measure their frequencies, but here are two tricks I find useful.

First, look at the noisiest voiced fragments of the spectrogram (the unvoiced fragments may well have a completely different formant structure, even though in this particular example the differences are slight). For our target moan, the beginning and end of the voiced fragment are fairly noisy, emphasizing the formant structure on the spectrogram. Eyeballing the spectrogram is a good start, and then for more precision we can look for peaks in the (relatively smoothed, broad-band) spectrum of the first ~50 ms of the voiced fragment in the target:

Second, run a sanity check: do your formant values imply that the speaker’s vocal tract is a meter long? Unless your target sound was produced by an elephant, something is clearly a bit off. For humans, vocal tract length (VTL) is normally between 13 and 18 cm. To estimate VTL from candidate formant frequencies, run:

schwa(ff)

## $vtl_apparent
## [1] 16.06523
## 
## $formantDispersion
## [1] 1101.758
## 
## $ff_measured
## [1]  767 1245 2960 3966 4710 6073 7261
## 
## $ff_schwa
## [1]  550.8791 1652.6374 2754.3956 3856.1538 4957.9121 6059.6703 7161.4286
## 
## $ff_relative
## [1]  39.2319968 -24.6658687   7.4645921   2.8485937  -5.0003325   0.2199735
## [7]   1.3903850
## 
## $ff_relative_semitones
## [1]  5.72988953 -4.90349337  1.24633671  0.48626480 -0.88806756  0.03804074
## [7]  0.23905015

A VTL of 16 cm is a lot for a female speaker, suggesting that the larynx might be actively lowered (which makes sense in a moan of disgust), but this VTL is not completely outlandish. If you do get an outlandish VTL, the likely explanation is that you have too many formants (i.e., you may have counted a single formant as two) or too few (i.e. you missed one). In this case, I don’t think there is a formant at 4700, and based on auditory feedback (which really means “try some numbers and listen to the result”), I would set formant frequencies at c(720, 1215, 2900, 4100, 6000, 7300), which implies a VTL of 14 cm - more reasonable for a female:

estimateVTL(c(720, 1215, 2900, 4100, 6000, 7300))

## [1] 13.91747

Apart from central frequencies, formants are characterized by their bandwidth and strength (amplitude). In breathy voices, formant bandwidth are often greater than assumed by default in soundgen, and this is easy to see by comparing the spectrogram and average spectra of the target and synthetic sounds. After some tinkering, using formantWidth = 1.5, formantDep = 1.25 produces a good result. formantCeiling = 5 tells soundgen to calculate formants up to 5 times the Nyquist frequency, improving the quality of spectral filter in the upper-frequency range. F1 seems to be particularly broad, so we can specify its bandwidth manually (which is not strictly necessary, only if we are aiming to get really close to the original).

Finally, so far we have assumed that all formants are stationary. However, notice that the formants in the target rise slightly, especially at the end of the unvoiced segment. This type of parallel formant transitions can easily be coded with the mouth parameter:

s1_3 = soundgen(
  # MISC
  temperature = .001,  # for reproducibility in this example
  addSilence = 0,  # easier to synchronize with target
  samplingRate = 44100,  # same as in target for ease of comparison
  play = T, plot = T,   # play & plot the result
  osc_dB = T, heights = c(1, 1), ylim = c(0, 8),  # plotting pars
  
  # TEMPORAL
  sylLen = 380,  # length of the voiced fragment
  noise = list(
    time = c(0, 250, 400, 680),    # noise amplitude defined at these time points
    value = c(-35, -25, -20, -40)  # noise amplitude, dB
  ),  
  attackLen = c(15, 50),  # 15 ms at onset, default 50 ms at offset
  
  # PITCH
  pitch = list(time = c(0, 85, 320, 380),
               value = c(280, 310, 240, 250)),
  
  # FORMANTS
  # formants = c(720, 1215, 2900, 4100, 6000, 7300),  # simplified version
  formants = list(f1 = list(freq = 720, width = 150), # manually adjusted F1 bandwidth
                  f2 = 1215, f3 = 2900, 
                  f4 = 4100, f5 = 6000, f6 = 7300),
  formantWidth = 1.5, formantDep = 1.25,  # increase width and amplitude of all formants
  formantCeiling = 5,   # high quality synthesis
  mouth = c(.5, .5, .5, .7)  # mostly neutral (.5), then formants rise gently at the end
)

# seewave::savewav(s1_3, f = 44100, filename = 's1_3.wav')
# system('ffmpeg -y -i s1_3.wav files/s1_3.ogg')

2.5 Source

There is something voodoo about controlling both the vocal tract transfer function (formants) and the glottal source (rolloff of harmonics in the voiced component) without measuring the two separately in the target. In other words, how do we know how much we should adjust formants and how much rolloff, when both can achieve similar results in terms of matching the spectra of the synthetic sound and the original?

Unless you are prepared to do inverse filtering (which requires making a lot of assumptions and may or may not work in any case), there is no analytical solution to this dilemma. For human sounds, a reasonable compromise is often to specify formant frequencies only (i.e., leaving the bandwidths and amplitudes at default values) and then to finish the job by adjusting the rolloff as needed. The goal is to ensure that spectral envelopes of the target and its synthetic version are similar at each time point. To illustrate what I mean by that, let us compare spectrograms of the target and our latest synthetic version (s3):

spectrogram('files/moan_180.wav', main = 'Target', ylim = c(0, 8))

spectrogram(s1_3, samplingRate = 44100, main = 'Synthetic', ylim = c(0, 8))

If you compare these two spectrograms and play both sounds repeatedly, you may notice the following differences:

Upper harmonics are stronger at the beginning of the target sound, whereas in the synthetic sound they are visible throughout. Solution: decrease the strength of upper harmonics over time. Parameters: rolloff, rolloffOct.
Breathing noise has more high-frequency energy in the target vs. synthetic version. Solution: change the spectrum of synthesized noise. Parameters: rolloffNoise, rolloffNoiseExp.
Upper harmonics are less distinct, “blurred” in the target vs. synthetic sound. This is due to a small amount of stochastic variation from one glottal pulse to the next in the natural sound, whereas the synthetic “glottal pulses” are perfectly regular. Solution: introduce some variability in our synthetic sound, particularly at the onset of voicing. Parameters: jitterDep, shimmerDip.

Putting it all together, here is one solution, by no means the only one. Note that jitter is present mostly at the very beginning (the first 50 ms). The strength of harmonics decreases throughout the voiced fragment.

s1_4 = soundgen(
  # MISC
  temperature = .001,  # for reproducibility in this example
  addSilence = 0,  # easier to synchronize with target
  samplingRate = 44100,  # same as in target for ease of comparison
  play = T, plot = T,   # play & plot the result
  osc_dB = T, heights = c(1, 1), ylim = c(0, 8),  # plotting pars
  
  # TEMPORAL
  sylLen = 380,  # length of the voiced fragment
  noise = list(
    time = c(0, 250, 400, 680),    # noise amplitude defined at these time points
    value = c(-35, -25, -20, -40)  # noise amplitude, dB
  ),  
  attackLen = c(15, 50),  # 15 ms at onset, default 50 ms at offset
  
  # PITCH
  pitch = list(time = c(0, 85, 320, 380),
               value = c(280, 310, 240, 250)),
  
  # FORMANTS
  # formants = c(720, 1215, 2900, 4100, 6000, 7300),  # simplified version
  formants = list(f1 = list(freq = 720, width = 150), # manually adjusted F1 bandwidth
                  f2 = 1215, f3 = 2900, 
                  f4 = 4100, f5 = 6000, f6 = 7300),
  formantWidth = 1.5, formantDep = 1.25,  # increase width and amplitude of all formants
  formantCeiling = 5,   # high quality synthesis
  mouth = c(.5, .5, .5, .7),  # mostly neutral (.5), then formants rise gently at the end
  
  # SOURCE
  rolloff = c(-2, -7),  # steeper rolloff over time (from -2 to -7 dB/octave)
  rolloffOct = -1,      # steeper rolloff in high- vs low-frequency range
  rolloffNoise = 0,     # simple white noise with a flat spectrum
  jitterDep = list(time = c(0, 50, 380), value = c(3, .05, .05))  # first 50 ms of voiced
)

# seewave::savewav(s1_4, f = 44100, filename = 's1_4.wav')
# system('ffmpeg -y -i s1_4.wav files/s1_4.ogg')

Let’s compare the spectrograms again, this time with oscillograms to ensure that amplitudes are also correct:

spectrogram('files/moan_180.wav', main = 'Target', ylim = c(0, 8), 
            osc_dB = T, heights = c(1, 1))

spectrogram(s1_4, samplingRate = 44100, main = 'Synthetic', ylim = c(0, 8), 
            osc_dB = T, heights = c(1, 1))

It is also very instructive to compare the average spectra. Note the key characteristics of the target spectrum that we would like to match: f0 20 dB lower than the harmonic close to F2, F3 (3 kHz) at about -25 dB, F6 (6 kHz) at about - 40 dB:

par(mfrow = c(1, 2))
seewave::meanspec(tuneR::readWave('files/moan_180.wav'), flim = c(0, 8), 
                  alim = c(-80, 20), dB = 'max0', main = 'Target')
seewave::meanspec(s1_4, f = 44100, flim = c(0, 8), 
                  alim = c(-80, 20), dB = 'max0', main = 'Synthetic')

par(mfrow = c(1, 1))

The fit is reasonable but imperfect around ~4 kHz. Further improvements could be achieved by tinkering with the filter. Then again, this entire excercize in spectrum-matching is of questionable validity, since we still don’t know anything about the conditions in which the target sound was recorded (microphone response, background noise, etc.). For most purposes, I would say that the code for the source in s4 is already an overkill.

And that’s about it. If you still feel that there is something magical about choosing these settings, you are perfectly right, and the reason is that there are no accessible (or perhaps even reliable) analytical solutions for measuring these characteristics. In fact, there is no unique solution, no “true” soundgen parameters that will reproduce the target with perfect accuracy. Try listening to the target and synthetic sounds repeatedly, comparing their spectrograms and average spectra. If you have the patience, it can also be helpful to look at spectral slices at a few key points. You can do this in R (try seewave::spec()), or you can save the output as a .wav file with seewave::savewav(), open both files in Audacity, and check their spectra at several time points. Then rinse and repeat until satisfied - unfortunately, this final polishing can be pretty tedious work.

TIP: there may be no limit to perfection, but there sure are limits to one’s personal time. Stop when satisfied, not when you can no longer tell the real and synthetic sounds apart!

2.6 Summary

So, just to recapitulate the sequence, here is the progression of our synthetic sounds.

Temporal structure:

Pitch:

Formants:

Source:

Target:

3 Example 2: stochastic scream

By “stochastic” I mean that this target sound was produced in a highly unstable mode of phonation, with nonlinearities like pitch jumps and deterministic chaos. One way to model such sounds is to embrace the stochasticity and leave the synthetic version underdetermined. In other words, we may be happy with a code that generates a different sound every time it is executed, capturing the “spirit” of the original sound - its mode of production seen as a generative model that could potentially have produced a variety of stochastic sounds, including the actually recorded version as only one instantiation. Alternatively, if we really do want to reproduce the target sound closely, we can create a controlled imitation of stochastic behavior, so that repeatedly executing the code would produce very similar, albeit random-sounding, vocalizations. Soundgen is designed to make both approaches straightforward, as described below.

So, here is our target:

spectrogram('files/scream_044.wav', windowLength = 25, ylim = c(0, 12),  osc = T)

# Download link: cogsci.se/soundgen/matching/files/scream_044.wav

Observe the subharmonics at ~100-300 ms, a pitch jump at ~300 ms and possibly another at ~950 ms, and an episode of deterministic chaos starting abruptly at ~300 ms and petering out by ~800 ms. These are the nonlinear phenomena we somehow need to reproduce. There is a bit of clipping, the click at ~315 ms is extraneous, followed by some metallic noise. Naturally, we don’t want to reproduce these background noises, but we’ll need to keep them in mind, since they influence the target spectrum.

I’m not going to discuss matching the target’s temporal structure. The intonation is relatively straightforward as well, except the little trick of adding pitch jumps. Formants are very difficult to see in the spectrogram, although the vowel quality can be heard to change from i-like to a-like (and the downward sweep of F2 is actually visible). The sound is too noisy to see harmonics clearly, so we cannot do anything very clever with the rolloff settings. Prior to adding nonlinearities, a good working model of our sound might look something like this:

s2_1 = soundgen(
  # MISC
  temperature = .001,  # for reproducibility in this example
  addSilence = 0,  # easier to synchronize with target
  samplingRate = 44100,  # same as in target for ease of comparison
  pitchSamplingRate = 44100,  # better for high-pitched vocalizations
  play = T, plot = T,   # play & plot the result
  osc = T, ylim = c(0, 8),  # plotting pars
  
  # TEMPORAL
  sylLen = 970,  # length of the voiced fragment
  
  # PITCH
  pitch = list(time = c(0,    130, 300, 301,  550,  840, 900, 901, 970),
               value = c(440, 640, 720, 1020, 1230, 830, 660, 550, 490)),
  
  # FORMANTS
  formants = list(f1 = c(400, 1100, 1100), 
                  f2 = c(2500, 1700, 1700), 
                  f3 = 3600),
  mouth = c(.45, .65, .5, .3),
  
  # SOURCE
  rolloff = c(-10, -6, -7, -14)
)

# seewave::savewav(s2_1, f = 44100, filename = 's2_1.wav')
# system('ffmpeg -y -i s2_1.wav files/s2_1.ogg')

3.1 Reproducing precisely

Now for nonlinear effects. First let’s add them at exactly the same time as in the target. To do so, we set nonlinBalance to 100% (the entire vocalization) and specifically note the timing of onset and offset of subharmonics and chaos:

s2_2 = soundgen(
  # MISC
  temperature = .001,  # for reproducibility in this example
  addSilence = 0,  # easier to synchronize with target
  samplingRate = 44100,  # same as in target for ease of comparison
  pitchSamplingRate = 44100,  # better for high-pitched vocalizations
  play = T, plot = T,   # play & plot the result
  osc = T, ylim = c(0, 8),  # plotting pars
  
  # TEMPORAL
  sylLen = 970,  # length of the voiced fragment
  
  # PITCH
  pitch = list(time = c(0,    130, 300, 301,  550,  840, 900, 901, 970),
               value = c(440, 640, 720, 1020, 1230, 830, 660, 550, 490)),
  
  # FORMANTS
  formants = list(f1 = c(400, 1100, 1100), 
                  f2 = c(2500, 1700, 1700), 
                  f3 = 3600),
  mouth = c(.45, .65, .5, .3),
  
  # SOURCE
  rolloff = c(-10, -6, -7, -14),
  subFreq = 300,  # <(f0 / 2)
  subDep = list(time =  c(0, 120, 130, 320, 330, 970), 
                value = c(0, 10, 10, 20, 0,   0)),  # trial & error
  shortestEpoch = 100,  # to enable short episodes of subh
  jitterDep = list(time = c(0, 260, 261, 790, 850, 970),
                   value = c(0, 0,  2,   1,  .1,   0)),
  shimmerDep = c(0, 20, 5)
)

# seewave::savewav(s2_2, f = 44100, filename = 's2_2.wav')
# system('ffmpeg -y -i s2_2.wav files/s2_2.ogg')

For subharmonics, we can use the “anchor format” on subDep to set the timing. As for g0 (subFreq), it is clearly visible on the spectrogram that there is a single subharmonic, i.e., g0 = f0 / 2. For short episodes of subharmonics, it may be useful to set shortestEpoch to a smaller value, like 50-100 ms.

To imitate deterministic chaos, we can use jitter and shimmer. Jitter is more perceptually relevant, so we trigger it precisely with the “anchor format”, obtaining a very abrupt onset of jitter at 260 ms, followed by a gradual weakening and return to tonal phonation by ~850 ms.

3.2 Reproducing stochastically

As an alternative to describing precisely when each nonlinear effect should be present, we can set nonlinBalance to some value under 100%, indicating that we want some nonlinear behavior, but letting soundgen deterimine the timing and amount stochastically. In this case the synthesized sound will be different every time the same code is executed. By definition, we no longer expect to match the target precisely, so we might as well increase the temperature, introducing some additional variation. Here is a way to generate and save several copies:

for (i in 1:5) {
  s2_3 = soundgen(
    # MISC
    temperature = .2,  # don't care about reproducibility anymore
    addSilence = 0,  # easier to synchronize with target
    samplingRate = 44100,  # same as in target for ease of comparison
    pitchSamplingRate = 44100,  # better for high-pitched vocalizations
    play = T, plot = T,   # play & plot the result
    osc = T, ylim = c(0, 8),  # plotting pars
    
    # TEMPORAL
    sylLen = 970,  # length of the voiced fragment
    
    # PITCH
    pitch = list(time = c(0,    130, 300, 301,  550,  840, 900, 901, 970),
                 value = c(440, 640, 720, 1020, 1230, 830, 660, 550, 490)),
    
    # FORMANTS
    formants = list(f1 = c(400, 1100, 1100), 
                    f2 = c(2500, 1700, 1700), 
                    f3 = 3600),
    mouth = c(.45, .65, .5, .3),
    
    # SOURCE
    rolloff = c(-10, -6, -7, -14),
    nonlinBalance = 70,  
    subFreq = 300, subDep = c(30, 0, 0),
    jitterDep = 1, shimmerDep = 15
  )
  # seewave::savewav(s2_3, f = 44100, filename = paste0('s2_3_', i, '.wav'))
  # call_sys = paste0('ffmpeg -y -i s2_3_', i, '.wav files/s2_3_', i, '.ogg')
  # system(call_sys)
}

Here are some examples of possible output:

4 Example 3: bout of laughter

The final example demonstrates how an entire bout consisting of several repetitive syllables can be synthesized with a single call to soundgen. Let me emphasize that this approach is only worth pursuing when syllables are repetitive - similar in structure, although not identical. When syllables differ significantly in intonation or voice quality, it’s better to synthesize them one at a time.

The target:

spectrogram('files/laugh_309.wav', ylim = c(0, 8),
            osc_dB = T, heights = c(2, 1))

# Download link: cogsci.se/soundgen/matching/files/laugh_309.wav

4.1 Temporal structure

There are seven voiced syllables with breathing noises in between them. The easiest way to generate them is to provide the average length of syllables and pauses and have soundgen create each syllable stochastically:

s = soundgen(nSyl = 7, sylLen = 60, pauseLen = 155, 
             temperature = .1,  # want stochastic behavior
             plot = TRUE, play = TRUE, osc = TRUE, heights = c(1, 1))

For more precision, we can measure the actual lengths (e.g., in Audacity using the Waveform view):

s = soundgen(
  nSyl = 7, 
  sylLen = c(55, 60, 55, 55, 60, 60, 80), 
  pauseLen = c(125, 140, 170, 160, 160, 175),
  temperature = .001,  # don't want highly stochastic behavior
  plot = TRUE, play = TRUE, osc = TRUE, heights = c(1, 1))

4.2 Intonation

The intonation of polysyllabic bouts is regulated with two arguments to soundgen: pitch gives the intonation contour of a typical, “average” syllable; pitchGlobal specifies how the intonation of each syllable deviates from this average by shifting the entire f0 contour up or down.

In this case, all seven syllables have a chevron-shaped intonation contour, which is first shifted slightly upwards on the second syllable and then goes down towards the end of the bout. Transforming this qualitative description into f0 values by analyzing spectral slices in Audacity (see example 1), we obtain the following parameters:

s3_1 = soundgen(
  # MISC
  temperature = .001,  # don't want highly stochastic behavior
  samplingRate = 44100,  # same as target
  pitchSamplingRate = 44100,  # high precision
  play = TRUE, plot = TRUE, osc_dB = TRUE, heights = c(2, 1), ylim = c(0, 8),
  
  # TEMPORAL
  nSyl = 7, 
  sylLen = c(40, 55, 55, 55, 55, 60, 80), 
  pauseLen = c(105, 140, 170, 160, 165, 175),
  attackLen = 10,
  
  # INTONATION
  pitch = c(280, 380, 280),  # f0 (Hz) for a typical syllable
  pitchGlobal = c(-2, 0, 0, 0, -1)  # deviation from typical contour, semitones
)

# seewave::savewav(s3_1, f = 44100, filename = 's3_1.wav')
# system('ffmpeg -y -i s3_1.wav files/s3_1.ogg')

4.3 Formants

The key to capturing the formants is to notice that the mouth is opened only once for the entire bout (in the first syllable), and then there is a slow parallel downward shift of all formants as the speaker gradually closes nos mouth. This is a textbook example of using the nSyl argument rather than repeatBout, since with nSyl the specified formant contour applies across multiple syllables. All we need to do is measure (static) formants in the noise between syllables 1-2 or 2-3 and then use the mouth setting to create parallel formant transitions across syllables, for example:

s3_2 = soundgen(
  # MISC
  temperature = .001,  # don't want highly stochastic behavior
  samplingRate = 44100,  # same as target
  pitchSamplingRate = 44100,  # high precision
  play = TRUE, plot = TRUE, osc_dB = TRUE, heights = c(2, 1), ylim = c(0, 8),
  
  # TEMPORAL
  nSyl = 7, 
  sylLen = c(40, 55, 55, 55, 55, 60, 80), 
  pauseLen = c(105, 140, 170, 160, 165, 175),
  attackLen = 10,
  
  # INTONATION
  pitch = c(280, 380, 280),  # f0 (Hz) for a typical syllable
  pitchGlobal = c(-2, 0, 0, 0, -1),  # deviation from typical contour, semitones
  
  # FORMANTS
  formants = c(1200, 1850, 2600, 4200, 4900, 6000, 6800, 8300),
  mouth = c(.2, .6, .5, .4, .3)
)

# seewave::savewav(s3_2, f = 44100, filename = 's3_2.wav')
# system('ffmpeg -y -i s3_2.wav files/s3_2.ogg')

4.4 Noise

With sounds like laughs, in which the noise component is very loud and perceptually relevant, it is crucial to get the type and timing of noise just right. We can specify the timing of noise bursts for the first syllable, and then these noise anchors are scaled in proportion to the length of all the following syllables. We have to compromise a bit, since the noise is about 10 dB louder and “darker” around the first two syllables compared to the rest of the bout. There is no way to scale the noise across syllables, so we would have to break the bout into two to achieve this.

Also note that in this case the formant structure of the noise component is not identical to that of the voiced component, so we use a separate formantsNoise argument.

s3_3 = soundgen(
  # MISC
  temperature = .001,  # don't want highly stochastic behavior
  samplingRate = 44100,  # same as target
  pitchSamplingRate = 44100,  # high precision
  play = TRUE, plot = TRUE, osc_dB = TRUE, heights = c(2, 1), ylim = c(0, 8),
  
  # TEMPORAL
  nSyl = 7, 
  sylLen = c(40, 55, 55, 55, 55, 60, 80), 
  pauseLen = c(105, 140, 170, 160, 165, 175),
  attackLen = 10,
  
  # INTONATION
  pitch = c(280, 380, 280),  # f0 (Hz) for a typical syllable
  pitchGlobal = c(-2, 0, 0, 0, -1),  # deviation from typical contour, semitones
  
  # FORMANTS
  formants = c(1200, 1850, 2600, 4200, 4900, 6000, 6800, 8300),
  mouth = c(.2, .6, .5, .4, .3),
  
  # NOISE
  noise = list(
    time = c(-160, -80, -10, 10, 35, 45, 230),  # scaled for syllable 1 (40 ms)
    value = c(-40, -20, -40, -30, -30, -40, -50)),
  formantsNoise = c(1200, 2300, 2650, 4200, 4900, 6000, 6800, 8300),  # ≠formants
  rolloffNoise = -6
)

# seewave::savewav(s3_3, f = 44100, filename = 's3_3.wav')
# system('ffmpeg -y -i s3_3.wav files/s3_3.ogg')

4.5 Source

As before, rolloff settings are a pain and have to be set by trial and error, making sure that the average spectrum of the target is approximately reproduced. Nonlinearities can be added in a very simple manner, since the voice is somewhat rough throughout the bout. We could also set nonlinBalance to a value under 100%, and then some syllables would be rougher than others.

s3_4 = soundgen(
  # MISC
  temperature = .001,  # don't want highly stochastic behavior
  samplingRate = 44100,  # same as target
  pitchSamplingRate = 44100,  # high precision
  play = TRUE, plot = TRUE, osc_dB = TRUE, heights = c(2, 1), ylim = c(0, 8),
  
  # TEMPORAL
  nSyl = 7, 
  sylLen = c(40, 55, 55, 55, 55, 60, 80), 
  pauseLen = c(105, 140, 170, 160, 165, 175),
  attackLen = 10,
  
  # INTONATION
  pitch = c(280, 380, 280),  # f0 (Hz) for a typical syllable
  pitchGlobal = c(-2, 0, 0, 0, -1),  # deviation from typical contour, semitones
  
  # FORMANTS
  formants = c(1200, 1850, 2600, 4200, 4900, 6000, 6800, 8300),
  mouth = c(.2, .6, .5, .4, .3),
  
  # NOISE
  noise = list(
    time = c(-160, -80, -10, 10, 35, 45, 230),  # scaled for syllable 1 (40 ms)
    value = c(-40, -20, -40, -30, -30, -40, -50)),
  formantsNoise = c(1200, 2300, 2650, 4200, 4900, 6000, 6800, 8300),
  rolloffNoise = -6,
  
  # SOURCE
  rolloff = -12,
  jitterDep = 1, shimmerDep = 35
)

# seewave::savewav(s3_4, f = 44100, filename = 's3_4.wav')
# system('ffmpeg -y -i s3_4.wav files/s3_4.ogg')

4.6 Summary

Here is the progression of our synthetic sounds for Example 3.

Temporal structure and pitch:

Formants:

Noise:

Source:

Target:

A final word of caution: note how the synthetic laugh is noticeably different from the target, even after we did our best to match the source spectrum, formants, noise, and temporal structure. Ultimately, the problem is that there are complex spectral changes within each syllable, not only across the entire bout. In addition, the noise component contains wheezing and gurgling noises, which we ignored. The synthetic version is thus a gross simplification. It could be improved by synthesizing each syllable with a separate call to soundgen(), but this would require laboriously matching each individual syllable. The bottom line is that bouts of noisy, short syllables - such as in laughs - are among the most challenging non-speech sounds to synthesize.

Reproducing a target sound with soundgen

A tutorial on manually matching control parameters to make a synthetic copy of an existing vocalization

Andrey Anikin

2020-09-17

1 Purpose

2 Example 1: breathy moan

2.1 Know thy enemy

2.2 Temporal structure

2.3 Intonation (pitch)

2.4 Filter (formants)

2.5 Source

2.6 Summary

3 Example 2: stochastic scream

3.1 Reproducing precisely

3.2 Reproducing stochastically

4 Example 3: bout of laughter

4.1 Temporal structure

4.2 Intonation

4.3 Formants

4.4 Noise

4.5 Source

4.6 Summary