The Multi-Scale Short-Time Fourier Transform
This is the companion page to the paper "Enhancing the Quality of Audio Transformations Using The
Multi-Scale Short-Time Fourier Transform". It contains various musical excerpts illustrating
digital audio effects using the multi-scale short-time Fourier transform (MS-STFT), and
comparing it with the standard short-time Fourier transform (STFT).
Background - The Time-Frequency Trade-off
Several audio effects can only achieve high quality by processing the audio signal in the frequency domain.
To convert an audio signal to and from the frequency domain, the Short-Time Fourier Transform (STFT) is commonly
used. However, the frequency domain suffers from the time-frequency trade-off: it is not possible to
have arbitrary good time and frequency resolutions at the same time. Time resolution can only be increased at the
expense of frequency resolution and vice-versa. The STFT allows one to use various time-frequency resolutions, however
it only permits the use of the same time-frequency resolution during the whole processing and at all frequencies.
The use of a constant time-frequency resolution (with the STFT) introduces artifacts in the audio signal as soon as
non trivial audio effects are applied: high frequency resolution (such as with 4096-point DFTs) smears the transients
(drums, attacks, etc); high time resolution (such as with 1024-point DFTs) makes steady tones sound dirty, and the best
trade-off (such as with 2048-point DFTs) usually introduces both problems.
The Multi-Scale Short-Time Fourier Transform (MS-STFT) aims at:
- Providing adaptive tiling, that is, adjusting the time-frequency resolution automatically in both time and
frequency according to the audio signal characteristics: high time resolution is automatically used for transients
and high frequency resolution is automatically used for steady sounds, mitigating the artifacts introduced by the use
of a constant time-frequency resolution (as in the STFT).
- Allow the implementation of a wide range of audio effects, including effects involving complex phase modifications.
Adaptive tiling schemes have already been proposed, but most of them cannot cope with complex phase changes. They are
hence limited to audio effects that do not affect the phase (noise reduction, filters, center channel removal, ...).
The MS-STFT can also cope with effects involving complex phase changes such as pitch shifting and time stretching.
- Allow reusing existing algorithms and implementations. The MS-STFT is not only based on
Fourier transforms (and hence do not require the implementation of some novel complex mathematical transformation),
but it allows existing audio effects based on the STFT to be reused with minor or no modification with the MS-STFT:
Paper and presentation
The paper is published by ACTA Press.
Here you can download the paper.
Here you can downoad the presentation (PowerPoint slides with music examples).
The PitchTech collection of LADSPA and VST plugins features several
effects implemented with either the STFT or the MS-STFT. They are all the effects that have a "Quality" parameter: quality
values less than the middle value are using the STFT while quality values greater than or equal to the middle value are using
The music excerpts are taken from the following sources:
"Imagine", by John Lennon, in Imagine.
"Horror" (background music of level), by Flair, in the Oscar video game.
"Cleanin' Out My Closet", by Eminem, in The Eminem Show.
Pitch shifting using the standard phase vocoder
Pitch shifting using the phase-locked vocoder
Time stretching using the phase-locked vocoder
Decomposition into sinusoids + noise
Center channel removal
Whisperization is implemented using a vocoder, where the source is white noise and the filter
is built of the formants of the signal.
Metallization is implemented using a very narrow comb-like filter. Result is close to
Harmonizing with the phase-locked vocoder
Decomposition into layers
This illustrates the first two steps of the algorithm: the decomposition into layers
corresponding to degrees of transience.
||1st layer (p=0, most transient)
||2nd layer (p=1)
||3rd layer (p=2, least transient)
For the special cas of audio time stretching, a more recent and faster technique is available.
PitchTech Home Page