Voice recognition software
by Chris Woodford. Last updated: December 8, 2014.
The idea of talking to your computer and have it understand you sounds like pure science fiction—but it's not so far-fetched. I wrote the words you're reading now not by typing them laboriously on my keyboard but by reading them into a microphone hooked up to my computer. A voice recognition program (sometimes called speech recognition or dictation software) identified my words and typed them out for me automatically. Not bad, eh? But how exactly does it work? How accurate is it? Is it worth a go? Let's take a closer look!
Photo: GIGO—"garbage in, garbage out"—is a basic rule of computing: if you feed a computer bad input to start with, don't expect it to produce great output. When it comes to voice recognition, the quality of the sound going into your computer is vitally important. You'll get much better results with a headset microphone like this, positioned close to your mouth, than with a desktop microphone or one built into your computer, which will readily pick up background noises that may be misinterpreted as speech.
Language in pieces
If you want to understand how a car engine works, you could take it to bits and study the pieces. Language is like this too. Our brains are incredibly good at processing words—so good, in fact, that we don't even think about how we do it most of the time. Consider for a moment the way you're reading this page. Your eyes are scanning the text line by line, recognizing small meaningful chunks of information (the words) sometimes just from the way they look and sometimes by reading the letters that make them up. (We've seen most words so many times that we don't need to study each letter before we recognize them.) The basic building blocks of written language are the 26 letters of the alphabet.
When it comes to understanding spoken information, things work a similar way. Instead of looking for words, we listen for them. Ideally, we can tell where one word ends and another begins by listening for a short period of silence in between them. Just as there are two ways of recognizing written text (word-by-word or letter-by-letter), so there are two ways of recognizing spoken text. One is to identify the sound of an entire word. If someone shouts your name ("Bob!"), you don't have to think to yourself "B-O-B—that sounds like my name": you just get the word immediately. Another way is to identify the component sounds that make up a word and then deduce from them what the whole word is. This is a bit like reading words by identifying the letters that make them up.
Voice recognition programs can recognize people's speech through a combination of these techniques. Just as written words in English are made up of 26 possible letters, so spoken words are made up of about 40 possible sounds known as phonemes. Crudely speaking, phonemes correspond to the syllables in words (there's a bit more to phonemes than that, but that's an easy way to think of it). In theory, a computer could understand anything you said if you trained it to recognize the 40 odd basic phonemes. Put another way, if you spoke any word, all your computer would have to do would be to split the overall word sound into its component phonemes, identify what letter sounds those phonemes represented, and then it would be able to figure out the word.
Screenshot: Dragon NaturallySpeaking is a typical voice dictation program—and this is what it looks like as you use it. As you speak words, they appear in a window in front of you and then go straight into your word processor. If your words are mis-recognized, it's important to go back and correct the computer or it will get worse at recognizing you, not better!
Language recognition software is quite a bit more sophisticated than this suggests. For example, it has to be able to cope with our sloppiness as speakers. We don't always say words exactly the same way. Our voices rise and fall in pitch and volume to convey different meanings or emotions. Sometimes we talk quickly, sometimes slowly. If you're as logical as a computer and you have to analyze something as variable as the human voice, things can get pretty tricky.
So voice recognition programs use a variety of statistical, pattern recognition techniques to help them. The more sophisticated programs are "taught" about the grammatical structure of language so they know which sorts of sentences make sense. Most of them have built in dictionaries and know which words tend to follow one another. For example, if you say "thank" and then a short word that sounds like "dew", the computer can guess that you mean "you" because "thank" and "you" often go together. Some programs also learn to recognize words you say often so they can guess that you mean "Constantinople" and not "Can't stand the noble". Computer programs can pull off these neat tricks using mathematical "guessing processes" called Hidden Markov models. A more sophisticated technique is to use a neural network—a computer program that learns to recognize recurring patterns in a broadly similar way to the human brain.
Voice recognition in the real world
Even if you've never dictated text to your computer, you've probably used voice recognition. Many cell phones have a feature called voice activated dialing. The first time you enter a friend's telephone number into your phone's address book, you might be asked to speak their name into the phone at the same time. Then, in future, whenever you want to call them, you just speak their name instead of dialing their number. You just say "Call Bob" and the phone uses voice matching software to recognize who you want. Many telephone and utility companies with automated switchboards also use voice recognition. Instead of pressing numbers on your telephone keypad to select different menu options, you can speak the option you want aloud. Voice recognition software built into the telephone system has been trained to recognize the voices of hundreds of different people—so it knows you mean option "three" and not option "nine".
Photo: Left: The latest iPods and iPhones have built-in voice recognition, but (on my machine at least) it's not very accurate or successful: you can't train it to the specific sound of your voice or correct the mistakes it makes. Right: "Say name now": Most decent cellphones have had voice recognition built-in for years. Even this old phone, from about 2001, has voice-activated dialing: you can dial a number simply by speaking a person's name. But you have to "train" your phone to recognize the name first by speaking it aloud and associating it with the details stored in your address book.
One of the great challenges of off-the-shelf voice recognition programs is to be able to recognize all kinds of different voices—male and female, old and young, and people of all different nationalities and races. Most programs are trained so they can recognize a wide range of different voices. Before you can use a voice recognition program properly, you generally have to put it through a short period of training (also known as enrollment) so it gets used to the peculiarities of your own voice. That involves reading out a piece of text that the program already knows and correcting any mistakes that it makes. Sometimes just a few minutes training is enough to get your voice dictation program up and running. The more you use the program, the better it gets (provided you always remember to correct your mistakes).
The challenge for an ordinary, off-the-shelf voice recognition program is to understand the similarities between voices that may be very different. But there is another important application of voice recognition where the differences—not the similarities—are vitally important. In criminal investigation, police officers often have evidence in the form of telephone messages or recorded conversations. If they later find a suspect whom they need to match to this evidence, they can analyze the sound pattern of that person's voice and compare it to the sound pattern of the recorded evidence that they have already. If the two sound patterns match, they can be reasonably confident the suspect made the original recording.