Logo Utrecht University



Automatic Speech Recognition

Automatic Speech Recognition

  • Instructor(s): Arjan van Hessen Utrecht University, University of Twente
  • Available also as advanced: No

Automatic Speech Recognition is the process of transforming spoken speech into a written version. The software (ASR) calculates each 10 ms which phoneme is likely to be in the audio signal. Then, the adjacent string of phonemes is turned into a string of words and … ready?


This non-technical tutorial is aimed at students/researchers who use (large quantities of) spoken narratives in their research and want to use Automatic Speech Recognition for transcript generation, phonetic research or other research where the relation between what & when was said, is relevant.

We will discuss the following topics:

  • ASR, how does it work and what is going wrong.
  • Making your own audio-content suitable for ASR
  • Recognising your own AV-recordings
  • ASR result: a table of words, times and confidentialities. What to do next?
  • Correcting the ASR results into what ???


Participants are invited to process their own AV-recordings. However, to avoid overloading the ASR servers, everyone is kindly requested to use a short fragment of max 5 minutes during this tutorial. Once you know how to do it, you can process the large files later on yourself.

Audio-file conversion

Most ASR-engines require a special format of the audio-files. The most common format is the WAV-file format (no compression) in a so-called 16kHz, 16-bit, mono format. Conversion of your audio-files can be done with Goldwave (Windows) and To-Wav-convertor (MacOS)


Before you can use the ASR-engines, please register yourself at:

Radboud/UTwente KALDI-ASR-Engine: https://webservices-lst.science.ru.nl/register/

The Dutch and English ASR-engines are available at:

Or for Dutch, English, German, Italian


(The OH-portal in München requires that you log-in with an academic/student account.)


In order to convert the table with the ASR-result into something more appropriate, you need to convert the CSV-files yourself or use FromTo that converts the result into a Karaoke view, Subtitles, or Childes-format (see link for more information).


Of course you may ask everything during (or after) the tutorial but if you have urgent question before and/or you want me to pay attention to some particular ASR-related items, please mail me at: a.j.vanhessen@utwente.nl

Additional information

More information can be found at: https://oralhistory.eu/

Return to the list of tutorials