Automatic Speech Recognition
Tutorial Teacher: Arjan van Hessen
Affiliation: University of Twente
Automatic Speech Recognition is the process of transforming spoken speech into an exact written version. The software (ASR) calculates each 30 ms the “best fitting” text to the audio signal. But… is this enough?
This partly non-technical tutorial is aimed at students/researchers who use (large quantities of) spoken narratives in their research and want to use Automatic Speech Recognition for transcript generation, phonetic research and/or other research where the relation between what & when was said, is relevant.
We will discuss the following topics:
- ASR, how did it and does it work nowadays and what is still going wrong.
- Making your own audio-content suitable for ASR
- Recognising your own AV-recordings
- ASR result: a full timed-text (or a table of words, times and confidentialities). What to do next?
- Correcting the ASR results into what ???
Participants are invited to process their own AV-recordings. However, to avoid overloading the ASR servers, everyone is kindly requested to use a short fragment of max 5 minutes during this tutorial. Once you know how to do it, you can process the large files later on yourself. So, bring this 5 min AV-recording with you.
Audio-file conversion
The KALDI-recogniser expects a so-called 16kHz, 16-bit, mono format audio-file. Conversion of your audio-files can be done with Goldwave (Windows) and To-Wav-convertor (MacOS). However, modern engines as Whisper (of OpenAI) can use more-or-less every audio-format. So, with Whisper, audio-conversion is no-longer necessary.
Before you can use the KALDI ASR-engines, please register yourself at:
Radboud/UTwente KALDI-ASR-Engine: https://webservices-lst.science.ru.nl/register/
The Dutch and English ASR-engines are available at:
- Dutch: https://webservices-lst.science.ru.nl/oral_history/
- English: https://webservices-lst.science.ru.nl/eng_ASR
Or for Dutch, English, German, Italian
(The OH-portal in München requires that you log-in with an academic/student account.)
The modern ASR-engine Whisper can easily be used via Phyton or Open Source software as SubtitleEdit (Windows only). Install SubtitleEdit befor this course. Whisper can recognise more than 90 different languages!
In order to convert the table with the ASR-result into something more appropriate, you need to convert the CSV-files yourself or use FromTo that converts the result into a Karaoke view, Subtitles, or Childes-format (see link for more information).
Of course you may ask everything during (or after) the tutorial but if you have urgent question before and/or you want me to pay attention to some particular ASR-related items, please mail me at: a.j.vanhessen@utwente.nl
Additional information
More information can be found at: https://speechandtech.eu/