Making spectrograms with soundgen

Andrey Anikin

2025-09-29

library(soundgen)
## Loading required package: shinyBS
## Soundgen 2.7.4
## Tips & demos on project's homepage: http://cogsci.se/soundgen.html
## Please cite as: Anikin, A. (2019). Soundgen: an open-source tool for synthesizing nonverbal vocalizations. Behavior Research Methods, 51(2), 778-792.

1 STFT spectrograms

Here is a default spectrogram of a few seconds of speech, produced with Short-Time Fourier Transform:

spectrogram('spectrograms_audio/speechEx.mp3')

TIP: input to ‘spectrogram()’ and most other soundgen functions can be a numeric vector with a specified sample rate, a single audio file (make sure you specify the path correctly, eg “home/user/myfile.wav”) or a path to folder, in which case all the audio files in this folder are processed in one go. For example, copy a few recordings into ‘~/Downloads/temp’ and run spectrogram('~/Downloads/temp', savePlots = ''), and a separate .png file with the spectrogram will be saved for each recording in the same folder.

There are many ways to make this visual representation more useful or visually pleasing, depending on what kind of sound you are working with and what you are interested in (pitch contours, resonance frequencies, fine temporal structure, modulation, etc.). For a start, here are some basic modifications of the ordinary STFT spectrogram.

1.1 Window length and step

If the analysis window is long, we get good frequency resolution, but poor time resolution. Focusing on just a second of audio (from 0.5 t 1.5 s), let’s visualize the intonation contour with a relatively long window of 50 ms. Keeping the step small means we have a lot of overlap between analysis frames, somewhat improving the time resolution:

spectrogram('spectrograms_audio/speechEx.mp3', from = 0.5, to = 1.5, 
            windowLength = 50, step = 5,  # or use the "overlap" argument instead of "step"
            ylim = c(0, 4), osc = FALSE, main = 'Narrow-band')

The opposite approach is to keep windows short, providing good time resolution but poor frequency resolution. This is very common in phonetics because it shows formants and rapid temporal changes like consonants:

spectrogram('spectrograms_audio/speechEx.mp3', from = 0.5, to = 1.5,
            windowLength = 5, ylim = c(0, 4), main = 'Broad-band', heights = c(2, 1))

Drawing inspiration from image enhancement techniques, multi-resolution spectrograms attempt to improve their time-frequency resolution by combining spectrograms with different window lengths. Soundgen offers the simplest possible approach - averaging of log-spectrograms produced with different window lengths and steps. Just specify several window lengths simultaneously:

spectrogram('spectrograms_audio/speechEx.mp3', from = 0.5, to = 1.5,
            windowLength = c(5, 15, 50), step = c(1, 5, 10),
            ylim = c(0, 4),  main = 'Multi-resolution', heights = c(2, 1))

The harmonic structure is as clear as with the long window of 50 ms, but we also see individual glottal pulses, and the click-like burst at 750 ms is not blurred in time (as it is with “windowLength = 50” - see above). There is no limit on how many spectrograms can be combined, but the processing obviously takes longer as their number increases.

1.2 Frequency scale

The default frequency scale is linear, but it is often useful to compress it in higher frequencies. The simplest approach is just to convert Hz to log-Hz (logarithmic or musical scale):

spectrogram('spectrograms_audio/speechEx.mp3', yScale = 'log', 
            ylim = c(0.06, 4), main = 'log')

There are also three popular psychoacoustic scales that approximate the frequency resolution in the human auditory periphery, which is nearly linear at low frequencies and logarithmic at high frequencies. All three are pretty similar:

spectrogram('spectrograms_audio/speechEx.mp3', yScale = 'ERB', main = 'ERB')

spectrogram('spectrograms_audio/speechEx.mp3', yScale = 'bark', main = 'bark')

spectrogram('spectrograms_audio/speechEx.mp3', yScale = 'mel', main = 'mel')

TIP: regardless of “yScale”, the labels and “ylim” are always given in (k)Hz

1.3 The amount of visual detail

We can increase or decrease the dynamic range to hide or show parts of the signal with little energy:

spectrogram('spectrograms_audio/speechEx.mp3', from = 0.5, to = 1.5, 
            windowLength = 5, dynamicRange = 40, 
            ylim = c(0, 5), osc = FALSE, main = 'Dynamic range = 40 dB')

A more flexible way to achieve the same effect is to adjust visual contrast and brightness, just like you would for any other image:

spectrogram('spectrograms_audio/speechEx.mp3', from = 0.5, to = 1.5, 
            windowLength = 5, contrast = .6, brightness = -.2,
            ylim = c(0, 5), osc = FALSE, main = 'Increase contrast, reduce brightness')

We can also blur or de-blur our spectrogram…

spectrogram('spectrograms_audio/speechEx.mp3', from = 0.5, to = 1.5, 
            windowLength = 5, contrast = .6, brightness = -.2,
            blur = c(-100, 20),
            ylim = c(0, 5), osc = FALSE, main = 'Sharpen in frequency, blur in time')

…or attempt to denoise by subtracting the spectrum of noisy parts

spectrogram('spectrograms_audio/speechEx.mp3', from = 0.5, to = 1.5, 
            windowLength = 5,  noiseReduction = 0.75, percentNoise = 5,
            ylim = c(0, 5), osc = FALSE, main = 'Denoise')

1.4 Bells and whistles

We can also adjust the colors, labels, tick marks, etc.

spectrogram(
  'spectrograms_audio/speechEx.mp3',
  windowLength = c(5, 15, 50), step = c(1, 5, 10),
  osc = 'dB',  # plot oscillogram in dB
  heights = c(2, 1),  # spectro/osc height ratio
  colorTheme = 'matlab', # pick color theme - see also ?hcl.colors
  cex.lab = .75, cex.axis = .75,  # text size and other base graphics pars
  grid = 2,  # lines per kHz; to customize, add manually with graphics::grid()
  ylim = c(0, 5),  # always in kHz
  xaxp = c(0, 5, 20), # specify location of tick marks for time: c(start, end, n) in s
  yaxp = c(0, 5, 10),  # same for frequency: c(start, end, n) in kHz
  xlab = 'Time axis', ylab = 'Frequency axis',
  main = 'A highly customized spectrogram' # title
  # + other graphical parameters  - see ?par() for base graphics
)
## Plotting with reduced resolution; increase maxPoints or set to NULL to override

Or just specify the actual colors

spectrogram('spectrograms_audio/speechEx.mp3', yScale = 'ERB', 
            main = 'Custom color palette',
            col = colorRampPalette(c('white', 'blue', 'yellow', 'red'))(50)
)

Remove all labels (see ?par for details on base R graphics):

spectrogram('spectrograms_audio/speechEx.mp3', osc = FALSE,
            xaxt = 'n', yaxt = 'n', ylab = '', main = '')

Completely custom axes and box (see ?axis, ?par, ?box, ?rect):

par(bg = 'lightgreen')  # background color for the whole plot
spectrogram('spectrograms_audio/speechEx.mp3', 
            osc = FALSE,  # NB: this won't work with an oscillogram!
            xaxt = 'n', yaxt = 'n', ylab = '', main = '')
box('plot', lty = 3, col = 'black', lwd = 2)  # box around the spectrogram
rect(0, 0, 10, 15, col = rgb(1, 1, 0, .25))  # background color for the spectrogram area
axis(1, at = c(0, 0.5, 1.5, 2.5, 3.5), labels = c('', '0.5 s', '1.5 s', '2.5 s', ''), 
     lwd = 2, col = 'blue', col.ticks = 'green', cex.axis = 0.75, family = 'mono', font = 2)
axis(2, at = c(0, 2, 4, 6, 8), labels = c('', '2 kHz', '', '6 kHz', ''), lwd = 2, 
     col = 'red', col.ticks = 'orange', family = 'serif', font = 3)

2 Reassigned spectrograms

spectrogram('spectrograms_audio/speechEx.mp3', from = 0.5, to = 1.5, 
            specType = 'reassigned',  windowLength = c(1, 2.5, 5, 10), 
            step = NULL, overlap = 95, rasterize = TRUE,
            ylim = c(0, 5), osc = FALSE, main = 'Reassigned')

This doesn’t look too alien, but a lot of interesting things happen under the hood. The technique of time-frequency reassignment utilizes not just the magnitudes, but also the phases from the complex FFT to improve the time-frequency resolution. The result can be rasterized (made into a regular rectangular grid), as above, but the actual output is a matrix of time-frequency points that no longer fall on a regular grid. Reassigned spectrograms may be particularly useful for visualizing formant transitions and very rapid frequency modulation. For example, let’s create a synthetic sound with a rapid frequency sweep and a vibrato at 40 Hz:

s1 = soundgen(sylLen = 800, pitch = c(100, 1100, 90, 1200, 110),
              formants = NULL, lipRad = 0, rolloff = -30, 
              vibratoFreq = 60, vibratoDep = 2, invalidArgAction = 'ignore',
              addSilence = 10, temperature = .001)
## Warning in validatePars(p, gp, permittedValues, invalidArgAction): 
## vibratoFreq should be between 1 and 20; override with caution

TIP see the vignette on sound synthesis to learn how the “soundgen()” function works

This kind of frequency modulation is very hard to capture with a conventional spectrogram because the fundamental frequency sweeps over a very large range, plus we have a vibrato with a period of just 1000/60 ~= 16 ms:

spectrogram(s1, 16000, windowLength = 7, step = NULL, overlap = 90, yScale = 'ERB')

We can see the vibrato in the second harmonic, but the fundamental itself is barely resolved in the lower range. Now compare it to a reassigned spectrogram with the same window length:

spectrogram(s1, 16000, windowLength = 7, step = NULL, overlap = 90, 
            yScale = 'ERB', specType = 'reassigned')

3 Auditory spectrograms

Ordinary spectrograms are based on Short-Time Fourier Transform. An alternative approach inspired by auditory perception is to pass the signal through a bank of bandpass filters. This is closer to how animals perceive sound and naturally provides better time resolution in high frequencies and better frequency resolution in low frequencies, like in some types of multi-resolution spectrograms.

audSpectrogram('spectrograms_audio/speechEx.mp3', from = 0.5, to = 1.5, 
               main = 'Basic auditory spectrogram')

The main parameters for controlling auditory spectrograms are the number of filters and their bandwidths. Here are a few variations:

audSpectrogram('spectrograms_audio/speechEx.mp3', from = 0.5, to = 1.5, 
               nFilters_oct = 24, # 24 filters per octave
               bandwidth = 1/12,  # bandwidth 1/12 of an octave (1 semitone)
               main = 'Good frequency resolution, slower')

audSpectrogram('spectrograms_audio/speechEx.mp3', from = 0.5, to = 1.5, 
               nFilters_oct = 4, bandwidth = 1/3,
               main = 'Lower frequency resolution, faster')

4 How to save pretty spectrograms

For a paper or report, it’s nice to save the spectrograms as high-resolution images in the correct format. I would normally just use built-in R functions for saving any graphics, namely png(), jpeg(), tiff(), etc. Like this:

png(filename = 'path_to_folder/my_spectrogram.png', width = 15, height = 7, 
    units = 'cm', res = 300)
spectrogram(...)
dev.off()

You may need to experiment with the size of your image (in pixels or cm), the font size for axis labels, etc. Increase the image size if R complains about margins.