Lies, damn lies and statistics

I’ve got a great product, but do I need to manufacture hype in order to sell it? There’s nothing wrong with using a little ‘artistic license’ to get an otherwise complex concept across, but you should draw the line at misleading. Using big numbers to beguile the audience is as bad as using big words to disguise the true meaning of something.

The need to describe their products has marketing professionals producing everything from datasheets to application notes, via white papers and case studies. That’s notwithstanding more technically focussed documents, such as API and user guides, which are the province of their engineering colleagues. Of course, everyone is searching for the best way to illustrate their product’s unique selling points (USPs).

Recently, I’ve noticed a trend in the realm of voice biometrics that I’ve found rather irritating. It centres around what, on the face of it, seems like a legitimate differentiator. It involves claims such as “my product is better, because it measures (or analyses) more characteristics than any other product.”

The significance of this becomes apparent when you examine some of the core principles of voice biometrics. Essentially, the technology works by analysing samples of audio from a speaker and producing a reference model for that person (a process known as enrolment). Thereafter, when seeking to verify the identity of that person, a fresh audio sample is captured, analysed and compared to the reference model. The resulting statistical output is a measure of confidence in the speaker being who they claim to be.

That being the case, you could be forgiven for thinking that the more points of comparison available for analysis, the better the result. Well, not really. You see, more is not necessarily better. If you can achieve the same result (statistical likelihood) from fewer points of reference, surely that’s a better outcome for everyone?

Let’s look at another biometric methodology as a comparison. Did you know that the FBI standard for fingerprint identification requires just 10 out of a possible 36 minutiae to establish a match? It begs the question, if there were more than 36 points of comparison, would the accuracy be any greater?

Here are some examples of what I mean: “We measure more than 100 unique characteristics to match someone’s voice.” Or, “Our next generation system analyses 10 times the number of audio features.” and my personal favourite “Our product measures nearly 4000 voice characteristics per second.”

At first glance these seem like impressive numbers, but in truth they are meaningless statistics. The “nearly 4000” figure quoted above sounds remarkably like a basic ‘textbook’ telephone bandwidth voice biometric system. Such systems commonly use 13 Mel Frequency Cepstral Coefficients (MFCCs), each one supplemented with two dynamic coefficients (39 values in total). Normally, those are calculated every 10 milliseconds (100 times per second) which, coincidentally, gives a total of 3,900 values per second.

However, it is misleading to refer to those as 3,900 ‘characteristics’, because none of them on their own is in any way characteristic of a particular speaker. If it’s all about the numbers, why not use the baseline 2016 National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) system figure? That used 23 MFCCs, plus dynamic coefficients, giving 6,000 ‘characteristics’ per second. That’s half as good again, right?

Commercial voice biometric systems use multiple MFCCs, proprietary coefficient values and prosodic features; all calculated at varying rates. The result is many thousands of characteristics per second of active speech. However, this number has no significance to an end user.

An MFCC is no more a characteristic of a speech signal than the original time domain samples of the audio waveform. This idea that bigger is better could be stretched to extreme lengths, with vendors choosing to quote individual bits as characteristics, arriving at ridiculous figures such as 256,000 per second.

What is important to anyone wishing to delve into the detail, is the definition of the characteristic features, the algorithms used to calculate them, and how they are used in the rest of the system.

It’s a bit like digital photography. Just because you have more pixels, it doesn’t mean you’ll get a better-quality image. Quality output is more about the lens, the sensor and the engineering than it is the sheer number of pixels.

Be wary of vendors who claim there are thousands of unique characteristics being measured, and that those include both physical and behavioural voice characteristics. Physical characteristics, like the size and shape of the larynx or nasal cavity, play an essential role in our voice identity. Exactly how is a system measuring these?

When it comes to assessing the performance of voice biometrics solutions, don’t get distracted by the numbers. As with many things in life, look for quality over quantity.

If you are thinking of introducing voice biometrics for authentication and verification in your business, and would like to discuss your requirements, contact one of our consultants today.