# Intro

## Purpose

The function soundgen is intended for the synthesis of animal vocalizations, including human non-linguistic vocalizations like sighs, moans, screams, etc. It can also create non-biological sounds that require precise control over spectral and temporal modulations, such as special sound effects in computer games or acoustic stimuli for scientific experiments. Soundgen is NOT meant to be used for text-to-speech conversion. It can be adapted for this purpose, but existing specialized tools will probably serve better.

Soundgen uses a parametric algorithm, which means that sounds are synthesized de novo, and the output is completely determined by the values of control parameters, as opposed to concatenating or modifying existing audio recordings. Under the hood, the current version of soundgen generates and filters two sources of excitation: sine waves and white noise.

The rest of this vignette will unpack this last statement and demonstrate how soundgen can be used in practice. To simplify setting the control parameters and visualizing the output, soundgen library includes an interactive Shiny app. To start the app, type soundgen_app() from R or try it online at cogsci.se/soundgen.html. To generate sounds from the console, use the function soundgen. Each section of the vignette focuses on a particular aspect of sound generation, both describing the relevant arguments of soundgen and explaining how they can be set in the Shiny app.

## Before you proceed: consider the alternatives

In R, there are at least three other packages that offer sound synthesis: tuneR, seewave, and phonTools. Both seewave and tuneR implement straightforward ways to synthesize pulses and square, triangular, or sine waves as well as noise with adjustable (linear) spectral slope. You can also create multiple harmonics with both amplitude and frequency modulation using seewave::synth() and seewave::synth2(). There is even a function available for adding formants and thus creating different vowels: phonTools::vowelsynth(). If this is ample for your needs, try these packages first.

So why bother with soundgen? First, it takes customization and flexibility of sound synthesis much further. You will appreciate this flexibility if your aim is to produce convincing biological sounds. And second, it’s a higher-level tool with dedicated separate subroutines for things like controlling the rolloff (relative energy of different harmonics), adding moving formants and antiformants, mixing harmonic and noise components, controlling voice changes over multiple syllables, adding stochasticity to imitate unpredictable voice changes common in biological sound production, and more. In other words, soundgen offers powerful control over low-level acoustic characteristics of synthesized sounds with the benefit of also offering transparent, meaningful high-level parameters intended for rapid and straightforward specification of whole bouts of vocalizing.

Because of this high-level control, you don’t really have to think about the math of sound synthesis in order to use soundgen (although if you do, that helps). This vignette also assumes that the reader has some training in phonetics or bioacoustics, particularly for sections on formants and subharmonics.

## Basic principles of sound synthesis in soundgen

Feel free to skip this section if you are only interested in using soundgen, not in how it works under the hood.

Soundgen’s credo is to start with a few control parameters (e.g., the intonation contour, the amount of noise, the number of syllables and their duration, etc.) and to generate a corresponding audio stream, which will sound like a biological vocalization (a bark, a laugh, etc). The core algorithm for generating a single voiced segment implements the standard source-filter model (Fant, 1971). The voiced component is generated as a sum of sine waves and the noise component as filtered white noise, and both components are then passed through a frequency filter simulating the effect of human vocal tract. This process can be conceptually divided into three stages:

1. Generation of the harmonic component (glottal source). At this crucial stage, we “paint” the spectrogram of the glottal source based on the desired intonation contour and spectral envelope by specifying the frequencies, phases, and amplitudes of a number of sine waves, one for each harmonic of the fundamental frequency. If needed, we also add stochastic and non-linear effects at this stage: jitter and shimmer (random fluctuation in frequency and amplitude), subharmonics, slower random drift of control parameters, etc. Once the spectrogram “painting” is complete, we synthesize the corresponding waveform by generating and adding up as many sine waves as there are harmonics in the spectrum.

Note that soundgen currently implements only sine wave synthesis. This is different from modeling glottal cycles themselves, as in phonetic models and some popular text-to-speech engines (e.g. Klatt, 1980). In future versions of soundgen there may be an option to use a particular parametric model of the glottal cycle as excitation source as an alternative to generating a separate sine wave for each harmonic.

2. Generation of the noise component (aspiration, hissing, etc.). In addition to harmonic oscillations of the vocal cords, there are other sources of excitation, which may be synthesized as some form of noise. For example, aspiration noise may be synthesized as white noise with rolloff -6 dB/octave (Klatt, 1990) and added to the glottal source before formant filtering. It is similarly straightforward to add other types of noise, which may originate higher up in the vocal tract and thus display a different formant structure from the glottal source (e.g., high-frequency hissing, broadband clicks for tongue smacking, etc.)

Some form of noise is synthesized in most sound generators. In soundgen noise is created in the frequency domain (i.e., as a spectrogram) and then converted into a time series via inverse FFT.

3. Spectral filtering (formants and lip radiation). The vocal tract acts as a resonator that modifies the source spectrum by amplifying certain frequencies and dampening others. In speech, time-varying resonance frequencies (formants) are responsible for the distinctions between different vowels, but formants are also ubiquitous in animal vocalizations. Just as we “painted” a spectrogram for the acoustic source in (1), we now “paint” a spectral filter with a specified number of stationary or moving formants. We then take a fast Fourier transform (FFT) of the generated waveform to convert it back to a spectrogram, multiply the latter by our filter, and then take an inverse FFT to go back to the time domain. This filtering can be applied to harmonic and noise components separately or - for noise sources close to the glottis - the harmonic component and the noise component can be added first and then filtered together.

Note that this FFT-mediated method of adding formants is different from the more traditional convolution, but with multiple formants it is both considerably faster and (arguably) more intuitive. If you are wondering why we should bother to do iFFT and then again FFT before filtering the voiced component, rather than simply applying the filter to the rolloff matrix before the iFFT, this is an annoying consequence of some complexities of the temporal structure of a bout, especially of applying non-stationary filters (moving formants) that span multiple syllables. With noise, however, this extra step can be avoided, and we only do iFFT once.

Having briefly looked at the fundamental principles of sound generation, we proceed to control parameters. The aim of the following presentation is to offer practical tips on using soundgen. For further information on more fundamental principles of acoustics and sound synthesis, you may find the vignettes in seewave very helpful, and look out for the upcoming book on sound synthesis in R by Jerome Sueur, the author of the seewave package. Some essential references are also listed at the end of this vignette, especially those sources that inspired particular routines in soundgen.

# Using soundgen

## Where to start

To generate a sound, you can either type soundgen_app() to open an interactive Shiny app or call soundgen() from R console with manually specified parameters. presets contains a collection of presets that demonstrate some of the possibilities.

## Audio playback

Audio playback may fail, depending on your platform and installed software. Soundgen relies on tuneR library for audio playback, via a wrapper function called playme() that accepts both Wave objects and simple numeric vectors. If soundgen(play = TRUE) throws an error, make sure the audio can be played before you proceed with using soundgen. To do so, save some sound as a vector first: sound = soundgen(play = FALSE) or even simply sound = sin(1:10000). Then try to find a way to play this vector sound. You may need to change the default player in tuneR or install additional software. See the seewave vignette on sound input/output for an in-depth discussion of audio playback in R. Some tips are also available here.

Because of possible errors, audio playback is turned off by default in the rest of this vignette. To turn it on without changing any code, simply set the variable playback = TRUE:

playback = c(TRUE, FALSE)[2]

## From the console

The basic workflow from R console is as follows:

library(soundgen)
s = soundgen(play = playback)  # default sound: a short [a] by a male speaker
# 's' is a numeric vector - the waveform. You can save it, play it, plot it, ...

# names(presets)  # speakers

# Combining two sounds

To achieve a complex vocalization, sometimes it may be necessary - or easier - to synthesize two or more sounds separately and then combine them. If the components are strictly consecutive, you can simply concatenate them with c(). If there is no silence in between, it is safer to use crossFade(), otherwise there can be transients like clicks between the two sounds:

par(mfrow = c(1, 2))
sound1 = sin(2 * pi * 1:5000 * 100 / 16000) # pure tone, 100 Hz
sound2 = sin(2 * pi * 1:5000 * 200 / 16000) # pure tone, 200 Hz

# simple concatenation
comb1 = c(sound1, sound2)
# playme(comb1)  # note the click
plot(comb1[4000:5500], type = 'l')  # note the abrupt transition, which creates the click
# spectrogram(comb1, 16000)

# cross-fade
comb2 = crossFade(sound1, sound2, samplingRate = 16000, crossLen = 50)
# playme(comb2)  # no click
plot(comb2[4000:5500], type = 'l')  # gradual transition
# spectrogram(comb2, 16000)
par(mfrow = c(1, 1))

If you want the two sounds to overlap, you can use addVectors(). Note that in this case cross-fading is not appropriate, so it may be safer to fade in-out the sounds to soften the attack. For example, here is how to add chirping of birds in the background:

sound1 = soundgen(sylLen = 700, pitchAnchors = 250:180, formants = 'aaao',
addSilence = 100, play = playback)
sound2 = soundgen(nSyl = 2, sylLen = 150, pitchAnchors = 4300:2200, attackLen = 10,
formants = NA, temperature = 0, addSilence = 0, play = playback)

insertionTime = .1 + .15  # silence + 150 ms
samplingRate = 16000
insertionPoint = insertionTime * samplingRate
comb = addVectors(sound1,
sound2 * .05,  # to make sound2 quieter relative to sound1
insertionPoint = insertionPoint)
# NB: soundgen softens attack by default, so no clicks are produced by overlapping
# playme(comb)
spectrogram(comb, 16000, windowLength = 10, ylim = c(0, 5), contrast = .5, colorTheme = 'seewave')

# Morphing two sounds

Sometimes it is desirable to combine characteristics of two different stiimuli, producing some kind of intermediate form - a hybrid or blend. This technique is called morphing, and it is employed regularly and successfully with visual stimuli, but not so often with sounds, because it turns out to be rather tricky to morph audio. Since soundgen creates sounds parametrically, however, morphing becomes much more straightforward: all we need to do is define the rules for interpolating between all control parameters. For example, say we have sound A (100 ms) and sound B (500 ms), which only differ in their duration. To morph them, we could generate five otherwise identical sounds that are 100, 200, 300, 400, and 500 ms long, giving us the originals and three equidistant intermediate forms - that is, if we assume that linear interpolation is the natural way to take perceptually equal steps between parameter values.

In practice this assumption is often unwarranted. For example, the natural scale for pitch is log-transformed: the perceived distance between 100 Hz and 200 Hz is 12 semitones, while from 200 Hz to 300 Hz it is only 7 semitones. To make pitch values equidistant, we would need to think in terms of semitones, not Hz. For other soundgen parameters it is hard to make an educated guess about the natural scale, so the most appropriate interpolation rules remains obscure. For best results, morphing should be performed by hand, pre-testing each parameter of interest and creating the appropriate formulas for each morph. However, for a “quick fix” there is an in-built function, morph.

morph takes two calls to soundgen (as a character string or a list of arguments) and creates several morphs using linear interpolation for all parameters except pitch and formant frequencies, which are log-transformed prior to interpolation and then exponentiated to go back to Hz. The morphing algorithm can also deal with arbitrary contours, either by taking a weighted mean of each curve (method = 'smooth') or by attempting to match and morph individual anchors (method = 'perAnchor'):

a = data.frame(time=c(0, .2, .9, 1), value=c(100, 110, 180, 110))
b = data.frame(time=c(0, .3, .5, .8, 1), value=c(300, 220, 190, 400, 350))
par(mfrow = c(1, 3))
plot (a, type = 'b', ylim = c(100, 400), main = 'Original curves')
points (b, type = 'b', col = 'blue')
m = soundgen:::morphDF(a, b, nMorphs = 15, method = 'smooth',
plot = TRUE, main = 'Morphing curves')
m = soundgen:::morphDF(a, b, nMorphs = 15, method = 'perAnchor',
plot = TRUE, main = 'Morphing anchors')
par(mfrow = c(1, 1))

Here is an example of morphing the default neutral [a] into a dog’s bark:

m = morph(formula1 = list(repeatBout = 2),
# equivalently: formula1 = 'soundgen(repeatBout = 2)',
formula2 = presets$Misc$Dog_bark,
nMorphs = 5, playMorphs = playback)
# use $formulas to access formulas for each morph,$sounds for waveforms
# m$formulas[[4]] # playme(m$sounds[[3]])

TIP Morphing a completely unvoiced sound with a voiced sound is not properly implemented. Add a very quiet voiced component to avoid glitches

# Matching an existing sound

When synthesizing a new sound with the function soundgen(), a serious challenge is to find the values of all its many arguments that will together produce the result you want. If the sound you are trying to create exists only in your imagination, there is nothing for it but to tinker with the controls until a satisfactory result is achieved. However, if you have an existing audio recording that you wish to duplicate, there are two ways to simplify the task of finding the optimal values of control parameters: (1) perform acoustic analysis of the target sound to guide the choice of soundgen settings, and (2) automatically optimize some soundgen settings to match the target. Below are some tools and tips for doing this.

DISCLAIMER: what follows is work in progress, not guaranteed to produce the desired results. Above all, don’t expect a magic bullet that will completely solve the matching problem without any manual intervention

## Matching by acoustic analysis

The first thing you might want to do with your target audio recording is to analyze it acoustically and extract precise measurements of syllable number and duration, pitch contour, and formant structure. You can use any tool of your choice to do this, including soundgen’s functions segment and analyze, which are described in the vignette on acoustic analysis. Once you have the measurements, you can convert them into appropriate values of soundgen arguments. An even easier solution is to use the function matchPars without optimization (maxIter = 0), which will perform a quick acoustic analysis and translate the results into soundgen settings, as follows:

target = soundgen(repeatBout = 3, sylLen = 120, pauseLen = 70,
pitchAnchors = data.frame(time = c(0, 1), value = c(300, 200)),
rolloff = -5, play = playback)  # we hope to reproduce this sound

m1 = matchPars(target = target,
samplingRate = 16000,
maxIter = 0)  # no optimization, only acoustic analysis
## [1] "Failed to improve fit to target! Try increasing maxIter."
# ignore the warning about failing to improve the fit: we don't want to optimize yet

# m1$pars contains a list of soundgen settings cand1 = do.call(soundgen, c(m1$pars, list(play = playback, temperature = 0)))

Without optimization, we simply match soundgen parameters based on acoustic analysis. In particular, matchPars() calls segment() and analyze() to get some basic descriptives of the target sound and to choose the appropriate settings for soundgen based on these measurements. If you are very lucky, this might in fact accurately match the temporal structure, pitch, and (stationary) formants of your target.

At this point I would really recommend copy-pasting your call to soundgen into the Shiny app and adjusting these settings in an interactive environment, rather than from the console. For example, to use the parameters in m1$pars, type call('soundgen', m1$pars), remove the “list()” part from the output, and you have your formula:

call('soundgen', m1$pars) # remove "list(...)" to get your call to soundgen(): soundgen(samplingRate = 16000, nSyl = 3, sylLen = 79, pauseLen = 114, pitchAnchors = list(time = c(0, 0.5, 1), value = c(274, 253, 216)), formants = list(f1 = list(time = 0, freq = 821, amp = 30, width = 122), f2 = list(time = 0, freq = 1266, amp = 30, width = 36), f3 = list(time = 0, freq = 2888, amp = 30, width = 117))) Load this formula into the Shiny app. To do so, run soundgen_app(), click “Load new preset” on the right-hand side of the screen, copy-paste the formula above (no quotes), and click “Update sliders”. If all goes well, all the settings should be updated, so that clicking “Generate” should produce the same sound as cand1 above. Now you can tinker with the settings in the app, improving them further. TIP It can be very helpful to have the Shiny app running, while also having access to R console. Start two R sessions to achieve that ## Matching by optimization Let’s assume that you have a working version of your candidate sound, which resembles the target in terms of its temporal structure, pitch contour, and perhaps even the formant structure. You can also add some non-tonal noise manually in the app, experiment with effects like subharmonics and jitter, and make other modifications. But the number of possible combinations of soundgen settings is enormous, making the process of matching the target sound very time-consuming. You can sometimes speed things up by using formal optimization. The same function as above, matchPars, offers a simple way to optimize several parameters by randomly varying their values, generating the corresponding sound, and comparing it with the target. The currently implemented version uses simple hill climbing and is best regarded as experimental. m2 = matchPars(target = target, samplingRate = 16000, pars = 'rolloff', maxIter = 100) # rolloff should be moving from default (-12) to target (-5): sapply(m2$history, function(x) {
paste('Rolloff:', round(x$pars$rolloff, 1),
'; fit to target:', round(x$sim, 2)) }) cand2 = do.call(soundgen, c(m2$pars, list(play = playback, temperature = 0)))

DISCLAIMER: my preferred method of matching soundgen parameters is manual. I open the target sound in Audacity (check both waveform and spectrogram view, plot spectrum, adjust spectrogram settings) and work in soundgen_app(), adjusting first syllable length, then pitch contour, formants, rolloff, noise, amplitude envelope, nonlinear effects, …

# References

Fant, G. (1971). Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations (Vol. 2). Walter de Gruyter.

Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. The Journal of the Acoustical Society of America, 67(3), 971-995.

Klatt, D. H., & Klatt, L. C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. The Journal of the Acoustical Society of America, 87(2), 820-857.

Fitch, W. T., Neubauer, J., & Herzel, H. (2002). Calls out of chaos: the adaptive significance of nonlinear phenomena in mammalian vocal production. Animal Behaviour, 63(3), 407-418.

Moore, R. K. (2016). A Real-Time Parametric General-Purpose Mammalian Vocal Synthesiser. In INTERSPEECH (pp. 2636-2640).

Sueur, J. (Forthcoming). Sound in R. Springer.

Wilden, I., Herzel, H., Peters, G., & Tembrock, G. (1998). Subharmonics, biphonation, and deterministic chaos in mammal vocalization. Bioacoustics, 9(3), 171-196.