The Multi-Scale Short-Time Fourier Transform

Overview

This is the companion page to the paper "Enhancing the Quality of Audio Transformations Using The Multi-Scale Short-Time Fourier Transform". It contains various musical excerpts illustrating digital audio effects using the multi-scale short-time Fourier transform (MS-STFT), and comparing it with the standard short-time Fourier transform (STFT).

Background - The Time-Frequency Trade-off

Several audio effects can only achieve high quality by processing the audio signal in the frequency domain. To convert an audio signal to and from the frequency domain, the Short-Time Fourier Transform (STFT) is commonly used. However, the frequency domain suffers from the time-frequency trade-off: it is not possible to have arbitrary good time and frequency resolutions at the same time. Time resolution can only be increased at the expense of frequency resolution and vice-versa. The STFT allows one to use various time-frequency resolutions, however it only permits the use of the same time-frequency resolution during the whole processing and at all frequencies.

The use of a constant time-frequency resolution (with the STFT) introduces artifacts in the audio signal as soon as non trivial audio effects are applied: high frequency resolution (such as with 4096-point DFTs) smears the transients (drums, attacks, etc); high time resolution (such as with 1024-point DFTs) makes steady tones sound dirty, and the best trade-off (such as with 2048-point DFTs) usually introduces both problems.

The Multi-Scale Short-Time Fourier Transform (MS-STFT) aims at:

Paper and presentation

The paper is published by ACTA Press.
Here you can download the paper.
Here you can downoad the presentation (PowerPoint slides with music examples).

Implementations

The PitchTech collection of LADSPA and VST plugins features several effects implemented with either the STFT or the MS-STFT. They are all the effects that have a "Quality" parameter: quality values less than the middle value are using the STFT while quality values greater than or equal to the middle value are using the MS-STFT.

Music Excerpts

The music excerpts are taken from the following sources:
"Imagine", by John Lennon, in Imagine.
"Horror" (background music of level), by Flair, in the Oscar video game.
"Cleanin' Out My Closet", by Eminem, in The Eminem Show.

Pitch shifting using the standard phase vocoder
Original music STFT, 4096-point DFT STFT, 2048-point DFT STFT, 1024-point DFT MS-STFT
Listen Listen Listen Listen Listen
Listen Listen Listen Listen Listen
Listen Listen Listen Listen Listen

Pitch shifting using the phase-locked vocoder
Original music STFT, 4096-point DFT STFT, 2048-point DFT STFT, 1024-point DFT MS-STFT Multiresolution
Listen Listen Listen Listen Listen
Listen Listen Listen Listen Listen
Listen Listen Listen Listen Listen Listen

Time stretching using the phase-locked vocoder
Original music STFT, 4096-point DFT STFT, 2048-point DFT STFT, 1024-point DFT MS-STFT Multiresolution
Listen Listen Listen Listen Listen
Listen Listen Listen Listen Listen
Listen Listen Listen Listen Listen Listen

Decomposition into sinusoids + noise
Original music STFT, 4096-point DFT STFT, 2048-point DFT STFT, 1024-point DFT MS-STFT
Listen Sinusoids, Noise Sinusoids, Noise Sinusoids, Noise Sinusoids, Noise

Center channel removal
Original music STFT, 4096-point DFT STFT, 2048-point DFT STFT, 1024-point DFT MS-STFT
Listen Listen Listen Listen Listen

Whisperization
Whisperization is implemented using a vocoder, where the source is white noise and the filter is built of the formants of the signal.
Original music STFT, 4096-point DFT STFT, 2048-point DFT STFT, 1024-point DFT MS-STFT
Listen Listen Listen Listen Listen
Listen Listen Listen Listen Listen

Noise gate
Original music STFT, 4096-point DFT STFT, 2048-point DFT STFT, 1024-point DFT MS-STFT
Listen Listen Listen Listen Listen

Metallization
Metallization is implemented using a very narrow comb-like filter. Result is close to robotization.
Original music STFT, 4096-point DFT STFT, 2048-point DFT STFT, 1024-point DFT MS-STFT
Listen Listen Listen Listen Listen
Listen Listen Listen Listen Listen

Harmonizing with the phase-locked vocoder
Original music STFT, 4096-point DFT STFT, 2048-point DFT STFT, 1024-point DFT MS-STFT
Listen Listen Listen Listen Listen

Decomposition into layers
This illustrates the first two steps of the algorithm: the decomposition into layers corresponding to degrees of transience.
Original music 1st layer (p=0, most transient) 2nd layer (p=1) 3rd layer (p=2, least transient)
Listen Listen Listen Listen

Related work

For the special cas of audio time stretching, a more recent and faster technique is available.

PitchTech Home Page