How soundgen works (briefly)

The purpose is to start with a few control parameters (e.g., the intonation contour, the amount of noise, the number of syllables and their duration, etc.) and to generate a corresponding audio stream. Ignoring dependencies between control parameters and the procedure for the creation of polysyllabic vocalizations, the algorithm for generating a single voiced segment basically implements the standard source-filter model. The voiced component is generated as a sum of sine waves, one for each harmonic, and the noise component is generated as filtered white noise. Both components are then passed through a frequency filter simulating the effect of the vocal tract. This process can be conceptually divided into three stages:

  1. Generation of the harmonic component (glottal source). We "paint" the spectrogram of the glottal source based on the desired intonation contour and spectral envelope by specifying the frequencies, phases, and amplitudes of a number of sine waves, one for each harmonic of the fundamental frequency. If needed, we also add stochastic and non-linear effects at this stage: jitter and shimmer (random fluctuation in frequency and amplitude), subharmonics, slower random drift of control parameters, etc. Once the spectrogram is complete, we synthesize the corresponding waveform by generating and adding up as many sine waves as there are harmonics in the spectrum.
  2. Generation of the noise component. In addition to harmonic oscillations of the vocal cords, there are other sources of excitation, which may be generated as some form of noise. For example, aspiration noise is synthesized as white noise with some basic rolloff and added to the glottal source before formant filtering. It is similarly straightforward to add other types of noise, which may originate higher up in the vocal tract and thus display a different formant structure (e.g., high-frequency hissing, broadband clicks, etc.)
  3. Spectral filtering (formants and lip radiation). The vocal tract acts as a resonator that modifies the source spectrum by amplifying certain frequencies and dampening others. Just as we "painted" a spectrogram for the acoustic source in (1), we now "paint" a spectral filter with a specified number of stationary or moving formants. We then take a Fast Fourier transform of the generated waveform to convert it back to a spectrogram, multiply the latter by our filter, and then take an inverse Fast Fourier transform to go back to the time domain. This filtering can be applied to harmonic and noise components separately or - for noise sources close to the glottis - the harmonic component and the noise component can be added first and then filtered together.

    This page was last updated on Apr 29, 2018.