Speech processing is divided into coding, recognition and synthesis. Speech coding serves for example in your mobile phone to convert the signal from microphone to stream of bits which is then sent to the digital channel. Synthesis involves creation of previously unseen speech from text. Phonexia is not much involved in coding and synthesis and concentrates on the recognition. Speech recognition is divided into:
speech transcription (also large vocabulary continuous speech recognition - LVCSR or speech to text - S2T) -
the system transcribes input speech signal into words that are readable for human or into recognition graphs (lattices)
useful for information retrieval systems.
keyword spotting (also KWS, keyword detection) - does not aim at full transcription of speech signal
but concentrates on keywords defined by the user. Each detection of a keyword can be completed by time-stamps (where the
keyword occurred) and confidence (how likely the keyword really occurred). KWS can work in off-line mode (browsing of stored
material) or on-line.
speaker recognition (also speaker detection, speaker verification) - does not aim at recognizing
"what was said" but rather "who said it". In speaker identification, the task is to assign speech
signal to one out of N speakers. In speaker verification, the claimed identity is known and the question
to be answered is "was the speaker really Mr. XYZ or an impostor?".
language identification (LID also language recognition) - detects the language a particular speech segment was spoken.
gender identification (GID also gender recognition) - assigns whether the speech comes from man or woman. In applications like speaker
recognition this simple tool can narrow by half the searching space. In
other applications you can use gender specific models for more precise
modeling.
The above recognition techniques provide huge amounts of meta-data. All modalities produce their results in form of output labels (for example for LID), strings or recognition lattices. In order to find the required information as fast as possible, Phonexia offers indexing, search and search engine. In off-line mode, this engine pre-processes available recognition results into of forward and reverse indexes and evaluates the confidences of each entry. In search mode, the engine works with these indexes, so that the results are available in "Google-like" mode in fraction of second.
Few rule-based techniques can be used for speech processing, most of our algorithms are based on speech data. To train a recognizer, one needs a significant amount of data with appropriate transcriptions. For LVCSR for example, exact text transcripts are needed. For speaker recognition, training speech with correct speaker identity is needed, and so one. In case speech database for particular application is not available, Phonexia offers two solutions:
Phonexia has huge experience with high performance computing (HPC) systems oriented on massive data-parallel applications and with appropriate storage systems as well. We offer consultancy, analyses and support for building HPC systems based either on clusters or SMP commodity HW and Linux operating systems together with other open source technologies.