How soundgen works (briefly)

Acoustic analysis

The purpose of acoustic analysis is to describe the sound with a number of objective measures such as its intensity, fundamental frequency (perceived as pitch), spectral characteristics (perceived as voice quality), etc. The main function for acoustic analysis in soundgen is "analyze()". The core algorithm behind it is the short-time Fourier transform (STFT), which divides the sound into many short windowed frames and analyzes the spectrum of each frame. Soundgen also has functions for detecting temporal regularities ("segment()"), extracting the modulation spectrum ("modulationSpectrum()"), self-similarity matrices ("ssm()"), etc. Please see the vignette on acoustic analysis for details.

Sound editing

There is plenty of dedicated software for audio editing, such as Audacity, but when you are working with sound in R, it can be very convenient to edit your sounds using scripts. Soundgen offers basic functionality for things like fading in/out (manipulating the attack) or cross-fading two sounds, but also more advanced features like the ability to transplant the amplitude envelope and spectral envelope (formants) from one sound to another or to perform filtering based on the modulation spectrum. There is currently no vignette on sound editing on soundgen, but all these functions are well documented, with examples of code.

Sound synthesis

Finally, soundgen offers parametric voice synthesis with the function "soundgen()". The purpose is to start with a few control parameters (e.g., the intonation contour, the amount of noise, the number of syllables and their duration, etc.) and to generate a corresponding audio stream. Ignoring dependencies between control parameters and the procedure for the creation of polysyllabic vocalizations, the algorithm for generating a single voiced segment basically implements the standard source-filter model. The voiced component is generated as a sum of sine waves, one for each harmonic, and the noise component is generated as filtered white noise. Both components are then passed through a frequency filter simulating the effect of the vocal tract. This process can be conceptually divided into three stages:

Generation of the harmonic component (glottal source). We "paint" the spectrogram of the glottal source based on the desired intonation contour and spectral envelope by specifying the frequencies, phases, and amplitudes of a number of sine waves, one for each harmonic of the fundamental frequency. If needed, we also add stochastic and non-linear effects at this stage: jitter and shimmer (random fluctuation in frequency and amplitude), subharmonics, slower random drift of control parameters, etc. Once the spectrogram is complete, we synthesize the corresponding waveform by generating and adding up as many sine waves as there are harmonics in the spectrum.
Generation of the noise component. In addition to harmonic oscillations of the vocal cords, there are other sources of excitation, which may be generated as some form of noise. For example, aspiration noise is synthesized as white noise with some basic rolloff and added to the glottal source before formant filtering. It is similarly straightforward to add other types of noise, which may originate higher up in the vocal tract and thus display a different formant structure (e.g., high-frequency hissing, broadband clicks, etc.)
Spectral filtering (formants and lip radiation). The vocal tract acts as a resonator that modifies the source spectrum by amplifying certain frequencies and dampening others. Just as we "painted" a spectrogram for the acoustic source in (1), we now "paint" a spectral filter with a specified number of stationary or moving formants. We then take a Fast Fourier transform of the generated waveform to convert it back to a spectrogram, multiply the latter by our filter, and then take an inverse Fast Fourier transform to go back to the time domain. This filtering can be applied to harmonic and noise components separately or - for noise sources close to the glottis - the harmonic component and the noise component can be added first and then filtered together.