The question of accuracy in voice biometrics

When it comes to any form of identity verification, accuracy is a core measure of success. So, in respect to the accuracy of voice biometrics, what are people saying?

 

A cursory review of vendor messaging reveals some strikingly similar claims:

“We offer accuracy of up to 99%”

“Typical accuracy rates are in the region of 98-99%”

“We deliver a 99.9% success rate in production”

“Our software offers error rates below 0.1%”

“The system is regularly tuned to 96-99% accuracy levels”

If we take some of these statements at face value, they seem impressive. However, when we try to add some context, they become less meaningful. For example, 99% of what?

If a new system delivers up to a 50% improvement in accuracy over previous performance, what does that actually mean? If the old system was 99% accurate, is the new system 149% accurate? Of course not. Is it then 99.5% accurate? Would you interpret 90% accuracy as meaning 10% of a result is inaccurate? Probably not.

What people really mean when they say 99% accuracy is that, 99 times out of 100, the result will be correct. For voice biometrics, this is a bit of a diversionary tactic. What’s more meaningful is to speak of error rates. For any security or identity verification application, the important factors are the false acceptance and false rejection rates (FAR and FRR). Accuracy, therefore, should reflect a combination of false positives and false negatives.

False acceptance refers to the percentage of times the system erroneously admits an impostor.

False rejection refers to the percentage of times that the system rejects a legitimate user.

The existence both false positives and false negatives belies any claim that a system can be 100% accurate. As these measures are rates, it’s correct to refer to them in percentage terms. For example, a 0.8% FRR means that, under test conditions, eight results in 1000 attempts were incorrect.

The nature of the technology means that both FAR and FRR have to be considered, since tightening up on one relaxes the other. That being the case, the best measure of accuracy is the equal error rate (EER) – the point of optimal performance where you will get no more false positives than false negatives.

 

“For a measure to be valid, testing must be conducted with a statistically relevant number of verification attempts.”

But what about results?

Voice biometrics is a technology based on probability and is dependent upon statistical algorithms. The probability of me being me is 1. The probability of you being me is 0. When a system returns a result, the closer it is to 1, the greater the confidence in the caller being who they claim to be. If the result is 0.98, the person is highly likely to be who they say they are. However, a result of 0.65 doesn’t equate to 65%, and neither figure is a measure of accuracy.

Some systems present results on a scale of 0-100, which unfortunately leads to them being misrepresented as percentages. Others present results in relation to a pre-determined benchmark. Some solutions simply present call-handlers with a yes/no, red/green, or go/no-go indication.

The benefit of benchmarking or threshold setting is that it enables the security-conscious organisation to determine what level of risk is acceptable. On the assumption that there will be false positives, setting the threshold to a lower value will minimise or eliminate those. Obviously, this will come at the expense of a higher incidence of false negatives (and vice versa with a higher threshold). Real-world applications typically vary between a false acceptance rate of below 0.5% and a false rejection rate of less than 5%.

The problem with “accuracy” as a measure is that if those were your published tolerances, and the measured performance was within this range, you could claim 100% accuracy.

The only viable method of determining the reliability (or acceptability) of a system, is to test it in the environment in which it will be deployed, with data derived from the real-world user population.

Useful links: