Preprocessing

Next: Properties of the data Up: Speech data Previous: Speech data Contents

Preprocessing

The preprocessing performed to turn the digitised speech samples to the observation vectors for the algorithms was as follows:

The signal was high-pass filtered to emphasise the important higher frequencies. This was done with a first order FIR filter having the transfer function $H(z) = 1 - 0.95 z^{-1}$ .
A 256-point Fourier transform with Hamming windowing was calculated for short overlapping segments. The overlapping part of two consecutive segments consisted of half of the segments.
The frequencies were transformed to Mel-scale to emphasise the important features for understanding the speech. This gave a 30 component vector for each segment.
The logarithm of the energies on the Mel-scale was used as observations.

These steps form a rather standard preprocessing procedure for speech recognition [54,33].

The Mel-scale of frequencies has been designed to model the frequency response of the human ear. The scale is constructed by asking a naïve listener when she found the heard sound to have double or half of the frequency of a reference tone. The resulting scale is close to linear at frequencies below 1000 Hz and nearly logarithmic above that [54].

Figure 7.1 shows an example of what the preprocessed data looks like.

**Figure:** An example of the preprocessed spectrogram of a speech segment. Time increases from left to right and frequency from down to up. White areas correspond to low energy of the signal and dark areas to high energy. The word in the segment is ``JOHTOPÄÄTÖKSIÄ'', meaning ``conclusions''. Every letter in the written word corresponds to one phoneme in speech. The silent areas in the middle correspond to the consonants t, p, t and k, thus revealing the segmentation of the utterance into phonemes.
$\includegraphics[width=\textwidth]{pics/johtopaatoksia}$

Next: Properties of the data Up: Speech data Previous: Speech data Contents

Antti Honkela 2001-05-30