Soundgen is an open-source algorithm for the synthesis of human non-speech sounds and animal vocalizations in a user-friendly way, with a limited number of acoustically meaningful parameters. This tool is primarily intended for psychological and biological research. Potentially it can also be used for human-machine interaction, since it provides a straightforward method for reproducing or generating de novo a wide variety of emotional vocalizations. Soundgen is published on CRAN as an R package, which also offers tools for acoustic analysis that were used in all my publications.

Bug reports, criticisms and suggestions are all warmly welcome! Soundgen is written and maintained by Andrey Anikin. You can reach me at rty.anik / at / rambler.ru


Essential links

How to install it

The current version is written in R, which is a popular general-purpose programming language. Before installing soundgen, make sure you have installed R and RStudio. To install soundgen, run install.packages('soundgen') in R environment.

You can also install the most recent developmental version or download the source code from github. To install from github, run:
install.packages(c("devtools", "tuneR", "seewave", "phonTools", "zoo", "shiny", "shinyBS", "reshape2", "mvtnorm", "plyr", "dtw", "grid"))

Known issues: the crucial dependency "seewave" may be tricky to install on Mac OS X. In case of issues with seewave, please refer to http://rug.mnhn.fr/seewave/

How to use it

The easiest way to get a feel for the program is to use the interactive Shiny app, which you can open in a browser by typing "soundgen_app()". NB: no sound in Safari! Please use Firefox or Chrome. The app is also available online. Server time is currently very limited; for extensive use, please install the package and run locally!

A detailed tutorial on sound generation with soundgen is provided in the vignette. To generate sounds from R console, call the soundgen() function with the desired parameters. Soundgen relies on "tuneR" library for audio playback, and it may intially fail, depending on your platform and installed software. If "soundgen(play = TRUE)" throws an error, you may need to change the default player in "tuneR" or install additional software. See the seewave vignette on sound input/output for an in-depth discussion of audio playback in R. Some tips are also available here.

How to cite it

Anikin, A. (2018) soundgen: Parametric Voice Synthesis. R package.


    title = {soundgen: Parametric Voice Synthesis.},
    author = {Andrey Anikin},
    year = {2018},
    note = {R package},

How it works (briefly)

The purpose is to start with a few control parameters (e.g., the intonation contour, the amount of noise, the number of syllables and their duration, etc.) and to generate a corresponding audio stream. Ignoring dependencies between control parameters and the procedure for the creation of polysyllabic vocalizations, the algorithm for generating a single voiced segment basically implements the standard source-filter model. The voiced component is generated as a sum of sine waves, one for each harmonic, and the noise component is generated as filtered white noise. Both components are then passed through a frequency filter simulating the effect of the vocal tract. This process can be conceptually divided into three stages:

  1. Generation of the harmonic component (glottal source). We "paint" the spectrogram of the glottal source based on the desired intonation contour and spectral envelope by specifying the frequencies, phases, and amplitudes of a number of sine waves, one for each harmonic of the fundamental frequency. If needed, we also add stochastic and non-linear effects at this stage: jitter and shimmer (random fluctuation in frequency and amplitude), subharmonics, slower random drift of control parameters, etc. Once the spectrogram is complete, we synthesize the corresponding waveform by generating and adding up as many sine waves as there are harmonics in the spectrum.
  2. Generation of the noise component. In addition to harmonic oscillations of the vocal cords, there are other sources of excitation, which may be generated as some form of noise. For example, aspiration noise is synthesized as white noise with some basic rolloff and added to the glottal source before formant filtering. It is similarly straightforward to add other types of noise, which may originate higher up in the vocal tract and thus display a different formant structure (e.g., high-frequency hissing, broadband clicks, etc.)
  3. Spectral filtering (formants and lip radiation). The vocal tract acts as a resonator that modifies the source spectrum by amplifying certain frequencies and dampening others. Just as we "painted" a spectrogram for the acoustic source in (1), we now "paint" a spectral filter with a specified number of stationary or moving formants. We then take a Fast Fourier transform of the generated waveform to convert it back to a spectrogram, multiply the latter by our filter, and then take an inverse Fast Fourier transform to go back to the time domain. This filtering can be applied to harmonic and noise components separately or - for noise sources close to the glottis - the harmonic component and the noise component can be added first and then filtered together.

    This page was last updated on Jan 24, 2018.