What Is Voice Biometrics?
Voice biometrics is a technology that utilizes the unique characteristics of the human voice for speaker identification, authentication, and forensic voice analysis.
Why is every person’s voice unique? As an audible pressure wave (typically caused by the vibration of a solid object), sound propagates through the air and modulates when it hits obstacles.
In the case of the human voice, this wave is produced when the air goes from the lungs through the vocal folds (vocal cords), causing their vibration. Then the wave is further modulated in the vocal tract by the larynx muscles (commonly called the voice box) and articulators – tongue, palate, cheeks, gums, teeth, lips, etc.
Each human voice is unique because of the individual form and size of the vocal organs and the manner in which they are used. For example, women and children usually have smaller larynxes and shorter vocal cords – that is why their voices are often higher.
The movements of the vocal organs are also unique – most of them are learned in childhood and reflect the individual manner of speech. Here is the essence of voice biometrics: If the frequency and dynamics of the evolution of the sound wave produced by human vocal organs can be analyzed and represented in mathematical form, this representation will also be unique, making it possible to identify a speaker.
The mathematical model of a human voice is called a voiceprint. Voiceprints can be stored and compared to other voiceprints. The comparison of voiceprints makes it possible to identify a person by voice, perform forensic voice analyses, and even determine additional biological characteristics such as a person’s gender or the estimation of a person’s age group (to a certain degree).
Today’s most advanced voice biometric technologies are language-, text-, accent-, and channel-independent. Thanks to artificial intelligence (AI), voice biometric technologies have become highly accurate and efficient, requiring only a few seconds of free speech to authenticate a person via their voice.
Main Use Cases of Voice Biometrics
The accuracy, efficiency, and seamless nature of voice biometric technology enable many unique applications of it to a large number of use cases across a wide range of industries.
Call Centers of Banks, Retail Finance, Telco, Insurance, and Utility Companies
- Passwordless voice authentication
- An additional security layer without additional security questions
- Improved Customer Experience (CX)
- Identity theft and subscription fraud detection
Law Enforcement Agencies
- Speaker identification
- Speaker search in a large number of recordings
- Fake emergency calls prevention
- Gender identification
- Age estimation
- Height estimation
- Face visualization
- Automatic forensic voice comparison
- Time-efficient voice analysis
- Unbiased forensic voice analysis for court-submissible evidence
- Voice biometric verification of remote employees
- Secure online conferences
- Prevention of internal fraud
- Secure ordering of services and goods via voice assistants (voicebots)
- Personalized voice interfaces in cars and smart homes
How Does Voice Biometrics Technology Work?
The process of voice biometric identification consists of two steps:
- Voiceprint extraction – a voice biometric system analyzes a voice sample and creates a mathematical model of the person’s voice (a voiceprint). If the system is analyzing the person’s voice for the first time, this phase is also called voice enrollment.
- Voiceprint comparison – the extracted voiceprint is compared with other stored voiceprints to find a match necessary for successful speaker verification or speaker identification.
Of these two steps, voiceprint extraction is more time-consuming while voiceprint comparison is very fast – millions of voiceprint comparisons can be performed in a second.
Voiceprint Extraction (Voice Enrollment)
How is a voiceprint extracted? An acoustic wave can be described as a waveform:
Or represented by a spectrogram:
The spectrogram provides a more detailed analysis of the acoustic wave - the vertical axis represents frequency, the horizontal axis represents time, and the brightness describes the amplitude of the wave.
Based on the spectrogram analysis, a voice biometric system analyzes the characteristics and dynamics of the acoustic wave the person produces (voice) and creates a mathematical model (typically a set of floating point numbers) that represents the unique features of the person’s voice.
Statistical and AI methods are used to find the right set of numbers to represent the shapes, sizes, and movements of the person’s vocal organs. This mathematical model of a voice is called a voiceprint.
When a person’s voiceprint is being created for the first time (voice enrollment), a few tens of seconds of the person’s speech is usually required to create a robust voiceprint for future voiceprint comparisons.
Voiceprint extraction (voice enrollment) can either be active or passive:
- Active voiceprint extraction means that the person is actively taking part in the verification process, usually repeating a particular phrase or sequence of words presented by the system.
- Passive voiceprint extraction, on the other hand, extracts a person’s voiceprint seamlessly during a natural conversation with a contact center agent without any conscious effort from the person.
Voiceprints are then stored in a voiceprint database in a particular format unique to each voice biometrics company. For that reason, voiceprints are not compatible with other voice biometric systems (vendors).
It is also impossible to recreate the original speech from a voiceprint. Therefore, the content of the speech will always remain anonymous.
Once the voiceprint from the enrollment process is stored in a database, it can be instantly compared with any other voiceprint extracted from just a few seconds of speech.
Voiceprints can be compared as:
- One-to-one (1:1) for speaker verification and forensic voice analysis
- One-to-many (1:N) for speaker identification, speaker search, and speaker spotting
- Many-to-many (N:M) for speaker clustering (as well as for speaker identification, speaker search, and speaker spotting)
The result of each voiceprint comparison is presented as a score that reflects the probability that two voiceprints match (the speaker is verified) or that the voiceprint matches one of the stored ones (the speaker is identified).
The score is a function of the ratio between two probabilities: the probability that the two estimated voiceprints belong to the same person and the probability that they belong to different people:
Whether a speaker is verified (or identified) depends on the score acceptance threshold, which can be set individually for any particular use case.
How Accurate Is Voice Biometrics?
There are two types of errors that can occur during the process of voice biometric authentication - False Acceptance (FA) and False Rejection (FR).
In other words, after comparing two voiceprints, the voice biometric system can:
- Incorrectly accept a speaker (an imposter, a fraudster, etc.) as a valid user (FA)
- Wrongly reject a valid user (FR)
Depending on the use case, the voice biometric system can be finetuned (by choosing an appropriate value of the score threshold above which a person is verified or identified) to be either more secure – having a lower False Acceptance Rate (FAR) – or be more benevolent with a lower False Rejection Rate (FRR).
The dependency between FAR and FRR is described by the Detection Effect Tradeoff (DET) curve (the red line in the graph below):
As can be deduced from the graph above, the FAR and FRR of voice biometric systems are interdependent.
If you increase the score acceptance threshold, it results in decreased FAR and, respectively, increased FRR – which might be useful if you need high security.
Vice versa, for police or law enforcement agencies, any suspicious caller might be important. Therefore, decreasing the score acceptance threshold may help detect a criminal. As a result, the FAR rises, but the FRR goes down, which might help catch the important criminal.
The point at which the system makes an equal number of false acceptances and false rejections is called the Equal Error Rate (EER). This percentage value is typically used for the overall evaluation of the voice biometric system’s accuracy.
For instance, in 2021, according to an evaluation by the Zurich Forensic Science Institute based on the forensic_eval_01 method, the Phonexia Speaker Identification system achieved a 1.2% EER after calibration, becoming the world’s most accurate voice biometrics technology for forensic voice comparisons available on the market at the time.
The accuracy of today’s latest generations of speaker recognition systems powered by Deep Neural Networks (DNNs) is extremely high.
The accuracy of voice biometric solutions can be further enhanced through calibrations that take into account the required FAR (keeping it low while ensuring FRR is within an acceptable range to offer a fine balance between security and customer experience) and also consider the unique characteristics of the voice channel and language.
How Secure Is Voice Biometrics?
There are three basic types of authentication based on:
- Something you have (e.g., an ID card, a key, a security token)
- Something you know (e.g., a password, a security question, a PIN)
- Something you are (e.g., a fingerprint, voice, face, iris)
Cards, tokens, and keys can be lost and counterfeited. Passwords and secret information can be obtained through data breaches. But it is incredibly difficult to falsify a person’s biometrics.
And this is especially true for modern voice biometric verification that can verify a person’s voice continuously in the background (regardless of language and words spoken) throughout the entire conversation.
Furthermore, according to the EU’s General Data Protection Regulation (GDPR), a voiceprint is considered sensitive personal information and needs to be handled with additional security measures (widely accepted even outside the EU).
As voiceprints are saved into a voiceprint database in a particular format unique to each voice biometrics vendor, they are incompatible with other voice biometric systems. It is also impossible to recreate the original speech or the person’s voice from the saved voiceprint (it cannot be reverse engineered).
The way voiceprints are generated naturally supports data security requirements outlined by GDPR and other similar data privacy policies.
Voice biometrics is a secure authentication method that improves security and increases the customer experience simultaneously.
The Difference Between Active and Passive Voice Biometrics
There are two types of voice biometrics that can be used in a voice authentication process:
Active Voice Biometrics
The word “active” refers to the fact that a user has to actively participate in the authentication process and pay active attention to it. For example, by saying a particular word or a combination of words such as “my voice is my password”.
Passive Voice Biometrics
On the contrary, passive voice biometric authentication (verification) is performed seamlessly during a natural conversation – the voice biometric system compares the voice of a person with the saved voiceprint regardless of spoken words.
This type of voice biometrics does not require any user's attention and is also language independent.
Below are the main differences between active and passive voice biometrics:
Active Voice Biometrics
Passive Voice Biometrics
Requires a person to repeatedly say a particular phrase or set of words to enroll their voice. Once enrolled into the system, the person has to say this phrase or set of words to get authorized.
Allows enrolling a person’s voice during a natural conversation. Once the voiceprint is created, the person is then authorized during the first few seconds of a natural conversation and can be accepted or rejected without even knowing it (no attention required).
Occurs at the beginning of a conversation.
Can continue throughout the entire conversation to detect when someone else is speaking instead.
Requires a person’s attention:
Voiceprint enrollment and further verification require a person’s effort and time.
Faster and more convenient for a person:
A person only needs to start a natural conversation with an agent for both voice enrollment and voice verification.
The Advantages of Using Voice Biometrics
The human voice is a natural part of every spoken conversation and, therefore, is always available for voice biometric verification.
An up-to-date voice biometrics technology (relying on hundreds of voice characteristics affected by the unique physiology and movement of a human’s vocal tract) can identify (authenticate) a person seamlessly and securely by voice.
This is especially useful for call centers of banks, retail finance, telco, insurance, and utility companies, as well as for smart homes, voice assistants, government institutions, and the healthcare industry.
Voice authentication is the easier and more secure authentication method compared to knowledge-based authentication – a customer does not have to share or keep any secret information that could be stolen or hacked.
Passive voice biometric authentication enables accurate identification of a person’s voice even after only a few seconds of a natural conversation with an agent (and during the whole conversation, if necessary).
Customers can access their accounts seamlessly and securely while their customer experience is greatly enhanced at the same time.
Voice biometrics makes it much harder for fraudsters to perform contact center fraud based on identity theft or fictional identities – it can detect fraudsters automatically based on voice.
As an authentication method, voice biometrics technology reduces the authentication time significantly, shortening it by more than 30 seconds per average call. An agent can use this time to take care of a customer’s request instead. This improves customer experience, saves company costs, and increases ROI.
Voice biometrics is also a great advantage for law enforcement agencies and their investigators. Whenever there is a need to identify and search for a person’s voice in a large amount of audio, voice biometrics technology can do this efficiently and automatically in real-time (investigators do not need to listen to each recording manually).
Forensic experts use cutting-edge voice biometrics technology for efficient automatic forensic voice comparisons to provide fast and unbiased forensic voice analysis.
Last but not least, voice biometrics technology respects a customer’s privacy as it compares voices using voiceprints from which the original audio recording (as well as speech and voice) cannot be recreated.
What Physical Characteristics Can Be Identified with Voice Biometrics?
The characteristics of a human voice depend on the size and shape of the vocal tract (vocal cords, larynx, articulators, etc.) as well as on the person’s way of speaking – native language, accent, and some other speech characteristics are learned in childhood.
The Gender of a Speaker
For instance, the longer and thicker the vocal cords are, the lower the voice. Modern AI-powered voice biometrics systems can learn to distinguish between female and male voices with excellent accuracy.
Voice biometric gender identification can be very useful for the personalization of calls with a voice assistant (voicebot), for the automatic categorization of calls in a contact center, or for the fast filtering of audio data based on a speaker’s gender.
Estimation of a Speaker’s Age Group
Based on the physical characteristics of vocal cords and changes in the vocal organs that occur during aging, a voice biometrics system can estimate the age group of an individual to a certain degree.
Although current age estimation technologies haven’t quite reached the accuracy that would be able to estimate a person’s age within the range of a just few years, they can still be used as a supportive voice biometric technology. For example, to automatically detect when an elderly person is talking to a virtual assistant and change the speed and style of a conversation to make the conversation more comfortable for that person.
Visualization of a Speaker’s Face
Not only can such obvious personal characteristics as gender and age be estimated by voice biometrics technology. Surprisingly, deep learning algorithms are able to reconstruct a person’s face using just a short audio recording.
The technology is based on the strong connection between speech and appearance: both correlate with age, gender, the shape of vocal organs, the structure of facial bones, and other physiological features.
A Speaker’s Height
Voice biometrics can also estimate a person’s height. Similar to gender and age estimation, deep neural networks can analyze voice for patterns that correlate with body height.
Frequently Asked Questions About Voice Biometrics
Why should I use voice biometrics?
Voice biometrics is a great alternative or an addition to authentication methods:
- Unlike traditional methods such as passwords, tokens, and security questions, voice cannot be forgotten or lost – it is always a natural part of every spoken conversation.
- Passive voice biometrics makes the authentication process extremely fast and seamless, saving customers’ and contact center agents’ time.
- Security and customer experience are both improved at the same time.
- Voice biometrics significantly reduces identity theft and subscription fraud in call centers.
Is voice better than fingerprints or other biometric factors?
Voice is as unique as a fingerprint and no less reliable than other biometric factors like face, retina, and iris. What distinguishes voice is that it is much easier to use – it does not require any scanner or other device except a microphone, and it can be used over the telephone line.
Can someone mimic or record my voice and get access to my account?
The human voice is unique due to the human vocal tract's complex physiology and its genuine muscles' movements. A listener might not hear a difference between a genuine speaker and a professional impersonator, but an AI-powered voice biometric system can recognize changes in hundreds of voice characteristics undetectable by a human.
What if I have a cold, dental anesthesia, or other issues affecting my voice and speech?
Voiceprints are created based on hundreds of voice characteristics. Therefore, if a person’s voice is affected by a cold or illness, there is still a high probability that a voice biometrics system will recognize the person’s voice.
How does background noise affect the voice biometrics system’s performance?
Poor speech signal quality (background noise, reverberation, multiple speakers speaking over each other) can reduce the accuracy of voice biometrics. This can be mitigated to some degree by calibrating a voice biometrics system for a given voice channel (using the voice channel’s real-world data).
Do I need an external microphone to use voice biometrics?
No, you can use any microphone built into your smartphone, laptop, or headset. But for the best voice recognition accuracy, it is recommended to create a voiceprint using the same device that is usually used by the person.
Can I be verified if I use different devices?
Yes. While the best results are achieved when the same devices are used during voice enrollment and voice verification, state-of-the-art voice biometric technologies are able to compensate for the source differences.
What personal data is used for voice biometric enrollment and authentication, and how is it protected?
For voice enrollment, a short sample of a person’s speech (voice) is needed to create a voiceprint, which is then stored in a secure database. For voice biometric authentication, another voiceprint is created from just a few seconds of the person’s speech and compared to the saved one.
Every vendor uses their own technology for voiceprint creation, which makes voiceprints incompatible with other voice biometrics solutions. Therefore, if a voiceprint is stolen from a database, it is useless because it cannot be used outside the system in which it was created.
Furthermore, as voiceprints contain only a limited set of values, they cannot be used to recreate the original audio recording of a person’s speech.
Voice Biometrics Software
Phonexia Speaker Identification (SID)
This cutting-edge voice biometrics technology from Phonexia can identify a person based on voice in just a few seconds of speech, regardless of language, accent, and words spoken.
It is used for automatic speaker identification, passive voice biometric verification, and speaker search in large amounts of audio by organizations and businesses worldwide.
Phonexia Voice Inspector
Powered by Phonexia’s best voice biometrics and designed specifically for forensic experts, this unique voice analysis software performs automatic, unbiased, and highly accurate forensic voice comparisons to support investigations and provide evidence in court.
Phonexia Gender Identification
Taking advantage of advanced deep neural networks, this Phonexia voice biometrics technology can identify a speaker’s gender with high accuracy, regardless of language, accent, and words spoken.
Voice Biometrics Glossary
- Active Voice Biometrics is a voice biometric authentication technology that requires a person to say a specific phrase to be authenticated by a voice biometrics system.
- Biometrics is the measurement of people’s unique body and behavioral characteristics that can be used for identification (a fingerprint, a voiceprint, etc.).
- Conversational Voice Biometrics – see Passive Voice Biometrics.
- Detection Error Tradeoff (DET) Curve is a graph that shows the relation between the False Acceptance Rate (FAR) and False Rejection Rate (FRR) of a voice biometric technology.
- Diarization is a voice biometrics technology for the automatic segmentation of multiple speakers based on their voices.
- Equal Error Rate (EER) is an operational point of a voice biometrics system at which it makes an equal amount of false acceptances (FAR) and false rejections (FRR). It can also be perceived as the accuracy of a voice biometrics system.
- False Acceptance Rate (FAR) is the percentage of errors made by a voice biometric system during voiceprint comparison when it decides these voiceprints belong to the same speaker while, in fact, they belong to different speakers.
- False Alarm Rate – see False Acceptance Rate.
- False Rejection Rate (FRR) is the percentage of errors made by a voice biometric system during voiceprint comparison when it decides these voiceprints belong to different speakers while in fact they belong to the same speaker.
- Free Speech Voice Biometrics – see Passive Voice Biometrics.
- Log-Likelihood Ratio (in speaker identification) is a logarithm of the ratio between two probabilities: the probability that the speakers in two voiceprints are the same versus the probability that they are two different people.
- Passive Voice Biometrics (also Free Speech Voice Biometrics, Conversational Voice Biometrics) is a voice biometrics technology that identifies a person’s voice naturally during a conversation, regardless of language, accent, and words spoken.
- Speaker Identification (SID) (also Speaker Recognition) is a voice biometrics technology that compares the voiceprint of a speaker with other saved voiceprints in a database to answer the question, “who is speaking?”
- Speaker Recognition – see Speaker Identification.
- Speaker Verification is a voice biometrics authentication method that compares the voiceprint of a currently speaking person one-to-one with the enrolled voiceprint saved in a database to answer the question, “is the speaker really who they claim to be?”
- Spectrogram is a graphical representation of an audio wave where the vertical axis displays frequency, the horizontal axis displays time, and the brightness represents the amplitude of the wave.
- Speech2Face is a technology for the reconstruction of a facial image of a person based on their voice.
- Vocal Tract is the cavity area in the human body from the nose and the nasal cavity down to the vocal cords in the throat where the sounds are filtered.
- Voice-Based Age Estimation is a voice biometrics technology that can estimate the age group of a speaker based on voice.
- Voice-Based Gender Identification is a voice biometrics technology that can identify a person’s gender based on voice.
- Voice Biometric Authentication is a check based on voice biometrics (see Speaker Verification) to determine whether a person accessing an account is really the owner of the account.
- Voice Biometrics is the measurement of people’s unique voice characteristics that can be used for identification and the calculation of physical attributes.
- Voice Biometrics Accuracy is the accuracy of a voice biometrics technology usually measured by the percentage of errors made when comparing voiceprints (see Equal Error Rate).
- Voice Enrollment is a process during which a speaker’s voice is analyzed with a voice biometrics technology, and a unique voiceprint is then saved into a database for future voice comparisons.
- Voice Identification is a voice biometrics approach that compares a given voiceprint with voiceprints already saved in a database to find the best match.
- Voice Recognition – see Voice Verification.
- Voice Verification is a voice biometrics approach that compares a voiceprint extracted from a person’s speech with an already saved (enrolled) voiceprint of this person to determine the probability the person is whom they claim to be.
- Voiceprint is the mathematical model of a human’s voice extracted by voice biometrics technology from an audio recording of the person’s speech.
Additional Voice Biometrics Resources
- Learn about the challenges behind the implementation of voice biometrics in a call center and discover how to use it to improve customer experience: The Essential Guide to Implementing Voice Authentication in Call Centers.
- This unique anonymized case study will take you through the challenges of a European Law Enforcement Agency and how it used Phonexia’s voice biometrics to suppress organized crime: The Unit for the Suppression of Organized Crime Case Study.
- Check out the Phonexia blog for the latest articles about voice biometrics and speech technologies.
- Read our eBook about solving the biggest problems of inbound call centers with modern technologies.