Spectral features (spectralFeaturesProc.m)

In order to characterise the spectral content of the ear signals, a set of spectral features is available that can serve as a physical correlate to perceptual attributes, such as timbre and coloration [Peeters2011]. All spectral features summarise the spectral content of the rate-map representation across auditory filters and are computed for individual time frames. The following 14 spectral features are available:

  1. 'centroid' : The spectral centroid represents the centre of gravity of the rate-map and is one of the most frequently-used timbre parameters [Tzanetakis2002], [Jensen2004], [Peeters2011]. The centroid is normalised by the highest rate-map centre frequency to reduce the influence of the gammatone parameters.
  2. 'spread' : The spectral spread describes the average deviation of the rate-map around its centroid, which is commonly associated with the bandwidth of the signal. Noise-like signals have usually a large spectral spread, while individual tonal sounds with isolated peaks will result in a low spectral spread. Similar to the centroid, the spectral spread is normalised by the highest rate-map centre frequency, such that the feature value ranges between zero and one.
  3. 'brightness' : The brightness reflects the amount of high frequency information and is measured by relating the energy above a pre-defined cutoff frequency to the total energy. This cutoff frequency is set to sf_br_cf = 1500 Hz by default [Jensen2004], [Peeters2011]. This feature might be used to quantify the sensation of sharpness.
  4. 'high-frequency content' : The high-frequency content is another metric that measures the energy associated with high frequencies. It is derived by weighting each channel in the rate-map by its squared centre frequency and integrating this representation across all frequency channels [Jensen2004]. To reduce the sensitivity of this feature to the overall signal level, the high-frequency content feature is normalised by the rate-map integrated across-frequency.
  5. 'crest' : The SCM is defined as the ratio between the maximum value and the arithmetic mean and can be used to characterise the peakiness of the rate-map. The feature value is low for signals with a flat spectrum and high for a rate-map with a distinct spectral peak [Peeters2011], [Lerch2012].
  6. 'decrease' : The spectral decrease describes the average spectral slope of the rate-map representation, putting a stronger emphasis on the low frequencies [Peeters2011].
  7. 'entropy' : The entropy can be used to capture the peakiness of the spectral representation [Misra2004]. The resulting feature is low for a rate-map with many distinct spectral peaks and high for a flat rate-map spectrum.
  8. 'flatness' : The SFM is defined as the ratio of the geometric mean to the arithmetic mean and can be used to distinguish between harmonic (SFM is close to zero) and a noisy signals (SFM is close to one) [Peeters2011].
  9. 'irregularity' : The spectral irregularity quantifies the variations of the logarithmically-scaled rate-map across frequencies [Jensen2004].
  10. 'kurtosis' : The excess kurtosis measures whether the spectrum can be characterised by a Gaussian distribution [Lerch2012]. This feature will be zero for a Gaussian distribution.
  11. 'skewness' : The spectral skewness measures the symmetry of the spectrum around its arithmetic mean [Lerch2012]. The feature will be zero for silent segments and high for voiced speech where substantial energy is present around the fundamental frequency.
  12. 'roll-off' : Determines the frequency in Hz below which a pre-defined percentage sf_ro_perc of the total spectral energy is concentrated. Common values for this threshold are between sf_ro_perc = 0.85 [Tzanetakis2002] and sf_ro_perc = 0.95 [Scheirer1997], [Peeters2011]. The roll-off feature is normalised by the highest rate-map centre frequency and ranges between zero and one. This feature can be useful to distinguish voiced from unvoiced signals.
  13. 'flux' : The spectral flux evaluates the temporal variation of the logarithmically-scaled rate-map across adjacent frames [Lerch2012]. It has been suggested to be useful for the distinction of music and speech signals, since music has a higher rate of change [Scheirer1997].
  14. 'variation' : The spectral variation is defined as one minus the normalised correlation between two adjacent time frames of the rate-map [Peeters2011].

A list of all parameters is presented in Table 25.

Table 25 List of parameters related to 'spectral_features'.
Parameter Default Description
sf_requests 'all'

List of requested spectral features (e.g. 'flux'). Type

help spectralFeaturesProc in the Matlab command window

to display the full list of supported spectral features.

sf_br_cf 1500 Cut-off frequency in Hz for brightness feature
sf_ro_perc 0.85 Threshold (re. 1) for spectral roll-off feature

The extraction of spectral features is demonstrated by the script Demo_SpectralFeatures.m, which produces the plots shown in Fig. 27. The complete set of 14 spectral features is computed for the speech signal shown in the top left panel. Whenever the unit of the spectral feature was given in frequency, the feature is shown in black in combination with the corresponding rate-map representation.


Fig. 27 Speech signal and 14 spectral features that were extracted based on the rate-map representation.

[Jensen2004](1, 2, 3, 4) Jensen, K. and Andersen, T. H. (2004), “Real-time beat estimation using feature extraction,” in Computer Music Modeling and Retrieval, edited by U. K. Wiil, Springer, Berlin– Heidelberg, Lecture Notes in Computer Science, pp. 13–22.
[Lerch2012](1, 2, 3, 4) Lerch, A. (2012), An Introduction to Audio Content Analysis: Applications in Signal Processing and Music Informatics, John Wiley & Sons, Hoboken, NJ, USA.
[Misra2004]Misra, H., Ikbal, S., Bourlard, H., and Hermansky, H. (2004), “Spectral entropy based feature for robust ASR,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 193–196.
[Peeters2011](1, 2, 3, 4, 5, 6, 7, 8) Peeters, G., Giordano, B. L., Susini, P., Misdariis, N., and McAdams, S. (2011), “The timbre toolbox: Extracting audio descriptors from musical signals.” Journal of the Acoustical Society of America 130(5), pp. 2902–2916.
[Scheirer1997](1, 2) Scheirer, E. and Slaney, M. (1997), “Construction and evaluation of a robust multifeature speech/music discriminator,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1331–1334.
[Tzanetakis2002](1, 2) Tzanetakis, G. and Cook, P. (2002), “Musical genre classification of audio signals,” IEEE Transactions on Audio, Speech, and Language Processing 10(5), pp. 293–302.