How to Teach Speech Technologies to Understand Klingon – Data Collection

April 23, 2019

Did you know that speech technologies can understand any language on the planet (and possibly even in outer space)? You just have to teach them. This is why we have chosen Klingon – the artificial language made up by the filmmakers of Star Trek – to show you how nowadays sci-fi-looking artificial intelligence technologies can learn almost anything.

Artificial intelligence can be described as the ability of a computer to think and learn. But as we know, learning is not always that easy, and we still do not have Star Trek’s amazing universal translator which is able to speak with all species in the galaxy. Although we are getting very close to the level of technologies used in Star Trek and we are able to develop a high-quality speech technology easier than ever before, there are still challenges we have to overcome.

A Piece of cake? Not At All!

It is surprising how simple machine learning, which is a function of artificial intelligence with the ability to learn and improve from experience without being explicitly programmed, sometimes seems to be. It looks as simple as feeding a bunch of data into an algorithm and then suddenly there it is – world-class artificial intelligence. This might be true for many cases but definitely, it doesn‘t work like this with recognizing speech. It is much tougher because there are almost limitless challenges to overcome, such as background noise, echo, different accents, recording quality… and the list goes on. All of these variations have to be included in the training dataset to be sure that the neural network works like a champ and mimics correctly the way the human brain actually operates. I am sure you have realized when you talk in a noisy room you unknowingly raise the pitch of your voice to talk over the clamor. Now imagine how noisy it must be on a Klingon starship. Klingons, or us humans, have no issue understanding each other in such an environment, however, neural systems have to be trained to deal with this uncommon case on similar data.

It All Starts with the Right Data

When we see Star Trek’s universal translator at work, we notice how it learns a new language by listening to conversations. With the increasing amount of speech, the computer gradually learns the language and it works similarly with building a voice recognition system to understand Klingonese. To perform at a high-end level, it is necessary to acquire a lot of training data, and there is no way to skip this step.

The process of the collection of voice data starts by bringing in real Klingons – or even non-Klingon characters from Star Trek who learned to speak Klingon (as, for example, Jean-Luc Picard) – to record conversations in different environments and also different dialects and accents, which are then transcribed manually. So, the computer has an exact representation of the spoken text to learn from. By recording the conversations, we get a range of sounds in a variety of voices. From there, an acoustic model is built that represents the relationship between an audio signal and the phonemes that make up speech. To complete the learning, we need a perfect language model, for which it is necessary to have great transcriptions, vocabulary list and text. The language model provides context to know the difference between words and phrases that sound similar and, since Klingons are known for their passion for opera, these librettos, for example, could then be used as one of the text sources.

The More Real the Better

A huge part of the success is having data that is as similar as possible to the real data later used to be transcribed to text, that is why in teaching the technology to fully understand Klingon we could be using, for example, a tricorder – another Star Trek invention also used for recording. Different recording devices have, of course, different characteristics, so collecting data using similar recorders like the ones used then by our customers is key. The training dataset also has to fulfill other requirements. To have a robust model, we need to collect around 1,000 hours of speech, and ideally, collect the recordings from thousands of Klingons of various age and gender who have had a spontaneous conversation with each other.

Short lesson at the end. If you hear some Klingon saying ’uH, you know he had a hard night and he is telling you he has a hangover, so watch out.

Live long and prosper!

Share now!