The Perils of Voice Recognition Passwords

June 2, 2017

With the constant evolution in the field of voice recognition systems, some companies and banks started to offer their clients the option to set up a password based on voice recognition as the ultimate form of security. However, the British broadcaster BBC recently informed of the test their reporter and his twin performed, which pointed out the dangers of relying solely on this kind of system in preventing unauthorized access to sensitive customer data.

As the server informed, BBC reporter Dan Simmons deliberately attempted to fool the voice recognition software utilized by HSBC, one of the largest banking organization in the world, which was implemented to prevent bank fraud. He let his non-identical twin call the service and mimic his voice in order to gain access to his account. And while it took eight attempts in total, the twin passed the voice verification and was allowed to access his brother’s sensitive data, including his balances and recent transactions.

Inevitable Failure

Regarding the HSBC voice biometry system failure, we always expected that this would happen one day. That is one of the reasons why we now recommend the banks to utilize our background voice verification during a call solution (voice biometry output is presented to a call center agent as an additional verification approach) or fraud detection, rather than simply use voice recognition as a password solution for their clients.

It is important to note that a good use-case design is essential here, as even the best speaker ID systems have about 1% equal error rate. And unfortunately, there are two groups of people in banks – the ones in marketing and the ones in security. These two groups fight each other and often security loses to user convenience in terms of priorities. As a result, the false acceptation rate might increase (to 5-10%) just because of the concessions to user convenience.

Further, while there are people with rather specific voices in the population, there are also people with very average voices, and voice biometry is understandably less applicable and suitable for the average ones. Specific calibration for each user is also crucial and we cannot be sure how this was implemented by the voice biometrics vendor in the case of the HSBC system. In order to reach the same level of security, people with average voices should be asked to speak long sentences, or repeat the sentence several times.

Future of Voice Biometrics

Yet another problem undoubtedly lies in speech synthesis. While there are reliable algorithms to detect speech synthesis, they come with severe limitations. They require full control over the recording and the audio transmission, such as an audio recording done in a smart phone app and the audio transmission implemented using a custom protocol (or a speaker ID system embedded in the device).

Otherwise, the signal processing of standard speech codecs remains comparable to the speech synthesis systems. If the signal was to be transmitted over the public telecommunication network, it would be very hard to detect any audio modifications. Recently, there has been progress in the development of neural network based speech synthesis (or speech modification) systems. We are not sure if this technique still uses any frame base processing (current anti-spoofing techniques are usually looking for artefacts left over by such processing). It will definitely be harder to distinguish synthetic or modified speech processed by these systems.

We believe that we will eventually reach a point when it will become impossible to distinguish human speech from artificial (it will be possible to create perfect mathematical models of human vocal tract). Thus, the only way to attain truly secure voice biometry systems (without fighting speech synthesizers all the time) is to combine human vocal tract description with some kind of secret personal knowledge. That may include voice-based passwords, answers to personal questions, specific pronunciation of some rare words that are not used by the person in common speech, etc.

Share now!