Automatic Speech Recognition

Tutorial Teacher: Arjan van Hessen

Affiliation: University of Twente

Automatic Speech Recognition is the process of transforming spoken speech into an exact written version. The software (ASR) calculates each 30 ms the “best fitting” text to the audio signal. But… is this enough?

Tutorial

This partly non-technical tutorial is aimed at students/researchers who use (large quantities of) spoken narratives in their research and want to use Automatic Speech Recognition for transcript generation, phonetic research and/or other research where the relation between what & when was said, is relevant.

We will discuss the following topics:

ASR, how did it and does it work nowadays and what is still going wrong.
Making your own audio-content suitable for ASR
Recognising your own AV-recordings
ASR result: a full timed-text (or a table of words, times and confidentialities). What to do next?
Correcting the ASR results into what ???

DIY

Participants are invited to process their own AV-recordings. However, to avoid overloading the ASR servers, everyone is kindly requested to use a short fragment of max 5 minutes during this tutorial. Once you know how to do it, you can process the large files later on yourself. So, bring this 5 min AV-recording with you.

Audio-file conversion

The KALDI-recogniser expects a so-called 16kHz, 16-bit, mono format audio-file. Conversion of your audio-files can be done with Goldwave (Windows) and To-Wav-convertor (MacOS). However, modern engines as Whisper (of OpenAI) can use more-or-less every audio-format. So, with Whisper, audio-conversion is no-longer necessary.

ASR-engines

KALDI

Before you can use the KALDI ASR-engines, please register yourself at:

Radboud/UTwente KALDI-ASR-Engine: https://webservices-lst.science.ru.nl/register/

The Dutch and English ASR-engines are available at:

Dutch: https://webservices-lst.science.ru.nl/oral_history/
English: https://webservices-lst.science.ru.nl/eng_ASR

Or for Dutch, English, German, Italian

https://clarin.phonetik.uni-muenchen.de/apps/oh-portal/

(The OH-portal in München requires that you log-in with an academic/student account.)

Whisper

The modern ASR-engine Whisper can easily be used via Phyton or Open Source software as SubtitleEdit (Windows only). Install SubtitleEdit befor this course. Whisper can recognise more than 90 different languages!

Post-processing

In order to convert the table with the ASR-result into something more appropriate, you need to convert the CSV-files yourself or use FromTo that converts the result into a Karaoke view, Subtitles, or Childes-format (see link for more information).

Questions

Of course you may ask everything during (or after) the tutorial but if you have urgent question before and/or you want me to pay attention to some particular ASR-related items, please mail me at: a.j.vanhessen@utwente.nl

Additional information

More information can be found at: https://speechandtech.eu/

EMLaR XXI 2025

Tutorials

Automatic Speech Recognition

Tutorial

DIY

Audio-file conversion

ASR-engines

Post-processing

Questions

Additional information

EMLaR XXI 2025

EMLaR XXI 2025

Recent Posts

Tutorials

Automatic Speech Recognition

Tutorial

DIY

Audio-file conversion

ASR-engines

Post-processing

Questions

Additional information

Organizing committee