Reproducing a target sound with soundgen

A tutorial on manually matching control parameters to make a synthetic copy of an existing vocalization

Andrey Anikin


1 Purpose

It is often useful to reproduce an existing vocalization, in effect obtaining its parametric model. The task is to find the soundgen settings that will achieve the desired result. For example, let’s take one of the sounds included in the online demo of human vocalizations, a female moan or sigh of digust called “moan_180”.

Here is the original recording:

And here is its synthetic version, which is a close (though obviously imperfect) approximation of the original:

The R code that generates the synthetic version in the demo is as follows:

s = soundgen(
  sylLen = 380,
  attackLen = c(15, 50),
  pitch = list(time = c(0, 85, 320, 380),
               value = c(280, 310, 240, 250)),
  rolloff = c(-2, -7), rolloffOct = -1,
  jitterDep = list(time = c(0, 50, 380), value = c(3, .05, .05)),
  formants = list(f1 = list(freq = 720, width = 150), 
                  f2 = 1215, f3 = 2900, 
                  f4 = 4100, f5 = 6000, f6 = 7300),
  formantCeiling = 5, formantWidth = 1.5, formantDep = 1.25,
  mouth = c(.5, .5, .5, .7),
  noise = list(time = c(0, 250, 400, 680),
               value = c(-35, -25, -20, -40)),
  rolloffNoise = 0,
  temperature = 0.001,
  addSilence = 0, samplingRate = 22050, pitchSamplingRate = 22050,
  play = T, plot = T, osc = T, ylim = c(0, 10)

Now the tricky question is: how do we get from the original sound to this code? As briefly described in the vignette on sound synthesis, I recommend doing so manually. The rest of this tutorial shows how.

DISCLAIMER: the following is a very personal take on acoustic analysis with a narrow focus on finding the most appropriate soundgen parameters. Please do not treat it as a definitive guide or a way to reveal the “ground truth” of the mentioned acoustic characteristics!

2 Example 1: breathy moan

Let’s take this same moan as the first walk-through example.

2.1 Know thy enemy

First let’s take a long, hard look at the sound we are trying to recreate. Open it in some interactive audio editor. I prefer Audacity, but any program will do, as long as it supports both waveform and spectrogram views. In Audacity, switch to spectrogram view (window size 1024 is about right for a sound with a sampling rate of 44100 Hz - ~23 ms, narrow-band spectrogram), zoom in on the most interesting region of the frequency axis (here 0 to 8 kHz), and adjust the time zoom level to your liking. Here is our target, then! Let’s listen to it a few times and examine the spectrogram and oscillogram:

## Loading required package: shinyBS
## Soundgen Tips & demos on project's homepage:
spectrogram('files/moan_180.wav', ylim = c(0, 8),
            osc_dB = T, heights = c(1, 1)) 

# Download link:

The first thing to notice is that this moan consists of two parts: the first ~400 ms are voiced (note the clearly visible harmonics), and then for ~300 ms we have a voiceless breathing sound (only formants are visible - no harmonics). These visible formants in the voiceless fragment are going to make our job much easier. In fact, we have two causes for celebration: (1) the formant structure is roughly comparable in the voiced and voiceless fragments (i.e., we are dealing with a relaxed [h] sound originating close to the glottis), so it is not absolutely necessary to specify the formants separately for the voiced and voiceless components; and (2) while formants may be hard to detect in purely harmonic sounds, the presence of spectral noise makes them much more prominent.

2.2 Temporal structure

With such a simple target vocalization, we can probably create a decent synthetic approximation in one go, without breaking it into several separately synthesized bouts, overlaid acoustic layers, or post-processing. A good way to start is to measure syllable length - separately for the voiced and unvoiced component. The length of the voiced component is about 360-380 ms (we can add a bit for the fade-in/out):

TIP: switch to waveform view in Audacity to make precise temporal measurements, don’t trust the spectrogram! You can go back and forth between waveform and spectrogram to make sure you have identified the right fragment

Since the sylLen parameter in soundgen regulates the length of voiced fragments only, we set it to 380 ms, and then we use the voice parameter to show when turbulent noise (breathing) should be added. The unvoiced fragment seems to start about 20 ms before the voiced fragment, but we’ll ignore it, since it could be an artifact. There is a burst of noise on the spectrogram as the voicing begins, but it seems to be related to irregularities in the voiced component, not breathing. Then the sound is mostly tonal, but the breathing noise grows gradually towards the end of voiced fragment, jumps rather abruptly to about -20 dB at 350-360 ms, and then there is a linear fade-out to -40 dB at 360 + 340 = 700 ms. Where did the numbers -20 and -40 dB come from? From a visual inspection of the waveform on a dB scale on the oscillogram, plotted directly in R or in Audacity:

## osc_dB is deprecated; please use osc(dB = TRUE) isntead
abline(h = 60, col = 'blue')
abline(h = 40, col = 'red')

For now, we can use these values in dB - and a bit of guesswork - for setting the preliminary noise. We can always readjust them later, once other settings have been finalized. Finally, the sound starts with the release of a glottal stop, making the attack very sharp. So the first approximation might be something like this:

s1_1 = soundgen(
  # MISC
  temperature = .001,  # for reproducibility in this example
  addSilence = 0,  # easier to synchronize with target
  samplingRate = 44100,  # same as in target for ease of comparison
  play = T, plot = T,   # play & plot the result
  osc_dB = T, heights = c(1, 1), ylim = c(0, 8),  # plotting pars
  sylLen = 380,  # length of the voiced fragment
  noise = list(
    time = c(0, 250, 400, 680),    # noise amplitude defined at these time points
    value = c(-35, -25, -20, -40)  # noise amplitude, dB
  attackLen = c(15, 50)  # 15 ms at onset, default 50 ms at offset

# seewave::savewav(s1_1, f = 44100, filename = 's1_1.wav')
# system('ffmpeg -y -i s1_1.wav files/s1_1.ogg')

Not too bad for a start. But why does that sound like a man? Well, we didn’t specify an intonation contour and formants, so soundgen falls back on the default values. And these default values happen to be based on a recording of the voice of the developer, who happens to be a man. To sound like the woman in the original recording, we have to match her intonation (fundamental frequency, pitch) and formant structure (transfer function of the vocal tract).

TIP: don’t be in a hurry to use the ampl parameter in order to force the synthetic sound into the right amplitude envelope. Set formant transitions and dynamic rolloff changes first, because the envelope also depends on them

2.3 Intonation (pitch)

I discuss some methods of automatic pitch tracking - and its perils - in the vignette on acoustic analysis. You are welcome to use your favorite pitch tracker, like the one in PRAAT, or run soundgen::analyze() and get pitch values from its output. As before, I believe a manual approach is safer and ultimately faster. For example, in this case automatic algorithms may struggle with the initial rapid rise of f0:

a = analyze('files/moan_180.wav', 
            windowLength = 30, overlap = 80,
            pitchMethods = c('autocor', 'dom'),
            ylim = c(0, 2))

Instead of trying to fine-tune pitch trackers, we can just estimate a few values of the fundamental frequency (f0) - enough to define a reasonable approximation of the original curve - directly from the spectrogram. Say, we might begin by finding the approximate peak of f0. In this case, it seems like harmonics peak at about 100 ms. In Audacity, we make a small selection around this time point and click “Analyze / Plot spectrum…” to obtain a spectral slice: