How Speech-To-Text Software Is Trained

March 17, 2023

Speech to text is a technology whose use is becoming more and more present in many areas of life thanks to its great versatility. Speech recognition not only makes it easier for individual users to use their devices faster and more efficiently, but it also helps many professionals to have a better work experience – from police officers investigating phone calls from suspected criminals to operators in call centers whose work is made simpler by voicebots that process customer information before speaking to the operator.

Back to the Beginning

Speech to text (audio to text or voice to text) is speech recognition software that converts spoken language to text. The possible uses of this technology are many and diverse. It is not, however, the purpose of this article to name them all, as anyone can think of numerous possibilities for its use with just a little imagination.

In this article, we will go back to the beginning and describe how engineers create speech-to-text software.

Although end-to-end speech-to-text systems (the technique where a model learns all the steps between the initial input and the final outputed results) have recently started to show certain strengths in the world of AI, Phonexia – like many other technology companies – creates speech to text as a hybrid system because it gives very positive results when evaluated on customer data.

Speech-to-text training consists of two main phases:

Data preparation
The machine learning process

Data, Data, and Data

Data is a key element in speech-to-text preparation, and, unlike subsequent machine learning processes, it always depends on the target language.

Data represents the greatest value of every artificial intelligence technology company. Why? The accuracy with which software performs a particular task – the image recognition of cars, plants, pupils, or, in our case, speech recognition in, for example, Spanish, Dutch, and Swahili – is highly dependent on the quality and diversity of the data.

Let us illustrate this with the example of "so-called" animal photo recognition software. What would be the problem of scarce data? If our software learns what a "dog" is from photos of a German Shepherd, Siberian Husky, Golden Retriever, and American Bulldog, it probably won't recognize a Chihuahua well. In fact, it may evaluate it as a "cat".

Picture of two dogs and a cat. Data is important for speech-to-text software training!

And it is the same with languages. If the data sets needed to train the Spanish speech to text of the Americas only come from Cuba or if speech recognition is expected to work for phone calls and is only trained with data recorded by a microphone in a studio, we can't expect great results.

Therefore, the more data there is, the better the final model we can expect. That is the golden rule of machine learning.

What Data Is Needed to Train Speech to Text?

Three types of data are required in the training process:

Pronunciation Dictionary

The pronunciation dictionary has between 10,000 and 50,000 words with their respective phonetic transcriptions. Each word is composed of letters. One letter, however, can represent several sounds. The letter "c" in "Cindy" is very different from the "c" in "coffee". Therefore, the pronunciations are transcribed with phonemes that, more objectively, represent the sounds of the language, for example, "s I n d i" drinks "k O f i".

Acoustic Data

Acoustic data consists of hundreds of hours of target language recordings with their respective transcriptions. These acoustic data sets represent the most important part of the data input of the training process. Their parameters are adjusted to the type of speech users wish to transcribe in the final software: type of speech, device (GSM, microphone), quality of recordings, or even the dialects in question. That is, if speech to text is to transcribe telephone conversations in American English, these crucial parameters must be taken into account when choosing the datasets involved in training.

Textual Data

This last group of data is a huge collection of texts that, like the recordings, is adapted to the target speech, for example, in terms of topic and style.

Training Process

When the data is collected, reviewed, and properly prepared, the training can begin.

The process is divided into three main parts that need input data and a long series of machine-learning processes.

As a result, three models are created – G2P, an acoustic model, and a language model – which are then put together in the process of packaging to create a new speech-to-text software:

The acronym G2P refers to "grapheme to phoneme", which forms the first part of the training and uses the phonetic dictionary as input data. Through several iterations and statistical methods, the system learns how to transform a word into a series of phonemes.
To train an acoustic model, acoustic data sets and G2P are needed. Thanks to the complex training system using neural networks, the acoustic model is able to convert audio into phonemes.
The last process prepares a language model, i.e., the n-gram model, which helps the acoustic model to determine the probabilities of occurrence of various word sequences. To prepare this model, a third type of data is used: textual data.

When all models are ready, the acoustic and language models are put together in a packaging process.

And that's it. The brand-new speech-to-text model is ready!

Conclusion

The preparation of speech-to-text software is a complex process that, as we have seen, relies on the input data and the particular model creation processes.

Both the quality of the chosen input data and the technological improvements are vital for creating well-performing speech-to-text software.

At Phonexia, we pay a lot of attention to both phases of this process. With each model and speech-to-text generation, we integrate improvements that help achieve increasingly positive results.

Additionally, we are currently experimenting with end-to-end systems such as transformers and conformers, as these may offer even higher speech-to-text accuracy.

Share now!