Next: Properties of the data
Up: Speech data
Previous: Speech data
  Contents
The preprocessing performed to turn the digitised speech samples to
the observation vectors for the algorithms was as follows:
- The signal was high-pass filtered to emphasise the important
higher frequencies. This was done with a first order FIR filter
having the transfer function
.
- A 256-point Fourier transform with Hamming windowing was
calculated for short overlapping segments. The overlapping part of
two consecutive segments consisted of half of the segments.
- The frequencies were transformed to Mel-scale to
emphasise the important features for understanding the speech. This
gave a 30 component vector for each segment.
- The logarithm of the energies on the Mel-scale was used as
observations.
These steps form a rather standard preprocessing procedure for
speech recognition [54,33].
The Mel-scale of frequencies has been designed to model the frequency
response of the human ear. The scale is constructed by asking a
naïve listener when she found the heard sound to have double or
half of the frequency of a reference tone. The resulting scale is
close to linear at frequencies below 1000 Hz and nearly logarithmic
above that [54].
Figure 7.1 shows an example of what the
preprocessed data looks like.
Figure:
An example of the preprocessed spectrogram of a speech
segment. Time increases from left to right and frequency from
down to up. White areas correspond to low energy of the signal
and dark areas to high energy. The word in the segment is
``JOHTOPÄÄTÖKSIÄ'', meaning ``conclusions''. Every letter in
the written word corresponds to one phoneme in speech. The
silent areas in the middle correspond to the consonants t, p, t
and k, thus revealing the segmentation of the utterance into
phonemes.
|
Next: Properties of the data
Up: Speech data
Previous: Speech data
  Contents
Antti Honkela
2001-05-30