If you have a speaker verification system (or plan to implement one) and haven’t decided on a passphrase yet, the following may be of interest.
You’ve probably seen the advert where the little kid says “my voice is my password” to access his phone. For secure applications, this isn’t necessarily the best idea. Commonly spoken phrases, by their very nature, are easily predicted and susceptible to spoofing attacks by hackers.
Similarly, using the same passphrase for all your applications (contact centre solutions, IVR systems, mobile apps etc) is akin to having the same password for everything. Not a good idea.
Granted, it’s not quite as bad as having “Password” as your password. The biometric component of speaker verification adds a much-needed degree of added security. Of course, it’s unlikely that your corporate security policy would permit use of the same passphrase by every user, across multiple systems. Ideally, your speaker verification system should be flexible enough to allow users to choose their own passphrase.
In a text-dependent system (i.e. one that is reliant on a specific phrase or sequence of words), you should be looking for a unique arrangement of words that takes about 2-3 seconds to utter. We recommend a phrase with a minimum of four syllables; something that is both easy to say, and easy to remember. This is enough to create a viable voiceprint when enrolment involves the analysis of several repeats of the same passphrase.
From a security point of view, try to avoid a passphrase that might be easily associated with you as an individual. Social engineering is a popular identity theft tool, so avoid something predictable like your home address.
In a text-dependent system, very few phonemes (the distinct, audible sounds that comprise a language) are needed, if you have enough samples. NB: although the English language has 26 letters in its alphabet, it has 44 phonemes. If you are enrolling via repetition, repeating the same sounds within the passphrase is not a bad thing, because we do not always pronounce them in the same way.
If you are using a text-dependent passphrase for verification, it makes sense to use the same phrase for enrolment. However, if you are implementing a text-independent system, the ideal enrolment would involve attaining a greater degree of phonemic content (sounds), repeated in many different contexts.
In the case of a text-independent system, enrolment data will need to be detailed enough to cover all the sounds expected to be encountered during verification. Recordings with lots of syllables will generally produce more precise models and better verification accuracy.
It is rare for one party in a telephone conversation to speak continuously for more than 10 to 20 seconds. So, enrolment recordings should be comprised of multiple, shorter passages; captured throughout the duration of a conversation.
Verification is then achieved during extended dialogue between the caller and an agent or through spoken responses to an IVR system, rather than by repeating a specific passphrase. If you are implementing a text-prompted system, where the response will also be recognised using ASR, several examples of each possible prompt will be required for enrolment.
If your prompt is to be a random, 4-digit sequence of numbers between 1 and 9, enrolment should consist of repeating each number several times. Counting from one to nine and back again in separate recordings will suffice.
Your choice of active or passive enrolment is likely to be determined by what is most practical at the time of enrolment. A passive enrolment, where enough audio is captured during a conversation, may be more natural experience but a less practical one. Multiple repetitions of a four second active passphrase or number sequence is a more artificial, but efficient process.
Find out more information on speaker verification and authentication, check out our Look who’s talking white paper.