You are here: Home page > Computers > Voice recognition software
Advertisement

A voice dictation microphone

Voice recognition software

by Chris Woodford. Last updated: December 8, 2014.

The idea of talking to your computer and have it understand you sounds like pure science fiction—but it's not so far-fetched. I wrote the words you're reading now not by typing them laboriously on my keyboard but by reading them into a microphone hooked up to my computer. A voice recognition program (sometimes called speech recognition or dictation software) identified my words and typed them out for me automatically. Not bad, eh? But how exactly does it work? How accurate is it? Is it worth a go? Let's take a closer look!

Photo: GIGO—"garbage in, garbage out"—is a basic rule of computing: if you feed a computer bad input to start with, don't expect it to produce great output. When it comes to voice recognition, the quality of the sound going into your computer is vitally important. You'll get much better results with a headset microphone like this, positioned close to your mouth, than with a desktop microphone or one built into your computer, which will readily pick up background noises that may be misinterpreted as speech.

Language in pieces

If you want to understand how a car engine works, you could take it to bits and study the pieces. Language is like this too. Our brains are incredibly good at processing words—so good, in fact, that we don't even think about how we do it most of the time. Consider for a moment the way you're reading this page. Your eyes are scanning the text line by line, recognizing small meaningful chunks of information (the words) sometimes just from the way they look and sometimes by reading the letters that make them up. (We've seen most words so many times that we don't need to study each letter before we recognize them.) The basic building blocks of written language are the 26 letters of the alphabet.

When it comes to understanding spoken information, things work a similar way. Instead of looking for words, we listen for them. Ideally, we can tell where one word ends and another begins by listening for a short period of silence in between them. Just as there are two ways of recognizing written text (word-by-word or letter-by-letter), so there are two ways of recognizing spoken text. One is to identify the sound of an entire word. If someone shouts your name ("Bob!"), you don't have to think to yourself "B-O-B—that sounds like my name": you just get the word immediately. Another way is to identify the component sounds that make up a word and then deduce from them what the whole word is. This is a bit like reading words by identifying the letters that make them up.

An example of dictating text into a computer using Dragon Dictate voice recognition softare.

Voice recognition programs can recognize people's speech through a combination of these techniques. Just as written words in English are made up of 26 possible letters, so spoken words are made up of about 40 possible sounds known as phonemes. Crudely speaking, phonemes correspond to the syllables in words (there's a bit more to phonemes than that, but that's an easy way to think of it). In theory, a computer could understand anything you said if you trained it to recognize the 40 odd basic phonemes. Put another way, if you spoke any word, all your computer would have to do would be to split the overall word sound into its component phonemes, identify what letter sounds those phonemes represented, and then it would be able to figure out the word.

Screenshot: Dragon NaturallySpeaking is a typical voice dictation program—and this is what it looks like as you use it. As you speak words, they appear in a window in front of you and then go straight into your word processor. If your words are mis-recognized, it's important to go back and correct the computer or it will get worse at recognizing you, not better!

Finding patterns

Language recognition software is quite a bit more sophisticated than this suggests. For example, it has to be able to cope with our sloppiness as speakers. We don't always say words exactly the same way. Our voices rise and fall in pitch and volume to convey different meanings or emotions. Sometimes we talk quickly, sometimes slowly. If you're as logical as a computer and you have to analyze something as variable as the human voice, things can get pretty tricky.

So voice recognition programs use a variety of statistical, pattern recognition techniques to help them. The more sophisticated programs are "taught" about the grammatical structure of language so they know which sorts of sentences make sense. Most of them have built in dictionaries and know which words tend to follow one another. For example, if you say "thank" and then a short word that sounds like "dew", the computer can guess that you mean "you" because "thank" and "you" often go together. Some programs also learn to recognize words you say often so they can guess that you mean "Constantinople" and not "Can't stand the noble". Computer programs can pull off these neat tricks using mathematical "guessing processes" called Hidden Markov models. A more sophisticated technique is to use a neural network—a computer program that learns to recognize recurring patterns in a broadly similar way to the human brain.

Voice recognition in the real world

Voice dialing on a Motorola cellphone.

Controlling an iPod with its built-in voice recognition.

Even if you've never dictated text to your computer, you've probably used voice recognition. Many cell phones have a feature called voice activated dialing. The first time you enter a friend's telephone number into your phone's address book, you might be asked to speak their name into the phone at the same time. Then, in future, whenever you want to call them, you just speak their name instead of dialing their number. You just say "Call Bob" and the phone uses voice matching software to recognize who you want. Many telephone and utility companies with automated switchboards also use voice recognition. Instead of pressing numbers on your telephone keypad to select different menu options, you can speak the option you want aloud. Voice recognition software built into the telephone system has been trained to recognize the voices of hundreds of different people—so it knows you mean option "three" and not option "nine".

Photo: Left: The latest iPods and iPhones have built-in voice recognition, but (on my machine at least) it's not very accurate or successful: you can't train it to the specific sound of your voice or correct the mistakes it makes. Right: "Say name now": Most decent cellphones have had voice recognition built-in for years. Even this old phone, from about 2001, has voice-activated dialing: you can dial a number simply by speaking a person's name. But you have to "train" your phone to recognize the name first by speaking it aloud and associating it with the details stored in your address book.

One of the great challenges of off-the-shelf voice recognition programs is to be able to recognize all kinds of different voices—male and female, old and young, and people of all different nationalities and races. Most programs are trained so they can recognize a wide range of different voices. Before you can use a voice recognition program properly, you generally have to put it through a short period of training (also known as enrollment) so it gets used to the peculiarities of your own voice. That involves reading out a piece of text that the program already knows and correcting any mistakes that it makes. Sometimes just a few minutes training is enough to get your voice dictation program up and running. The more you use the program, the better it gets (provided you always remember to correct your mistakes).

The challenge for an ordinary, off-the-shelf voice recognition program is to understand the similarities between voices that may be very different. But there is another important application of voice recognition where the differences—not the similarities—are vitally important. In criminal investigation, police officers often have evidence in the form of telephone messages or recorded conversations. If they later find a suspect whom they need to match to this evidence, they can analyze the sound pattern of that person's voice and compare it to the sound pattern of the recorded evidence that they have already. If the two sound patterns match, they can be reasonably confident the suspect made the original recording.

Finding phonemes

Have you ever recorded your voice onto a tape recorder or computer that shows the sound level of the recording? If so, you'll know that the words you speak vary predictably in volume. Each word starts off quietly (in a phase called "attack"). After reaching a maximum level, the volume drops (or "decays") to a certain level, stays at that level (known as "sustain") for a while, and then falls off to silence (a final phase called "release"). The pattern of attack, decay, sustain and release (or ADSR) that a sound follows is called an envelope. The shape of the envelope is what makes the sound of your voice different from the sound of a piano playing a note at the same pitch (or the sound of any other instrument, for that matter).

If I say the word rhinoceros very slowly, there are four separate envelopes split up by a short period of silence. Each of these envelopes matches one of the syllables in the word:

Voiceprint for the word rhinoceros spoken slowly

What you see here is a graph showing how the volume of the sound (the vertical axis) changes with time (the horizontal axis).

If I say the word at normal speed, the four envelopes run together into something of a blur. Note that there is no gap between the syllables when I say the word properly. Also notice how the word starts off really loud and ends really quietly:

Voiceprint for the word rhino spoken at normal speed

Now imagine you're a computer trying to recognize what I say. You could recognize the complete pattern of the word "rhinoceros"—but that would mean you have to learn the sound pattern of every possible word in the dictionary. Or you could try to identify the separate syllables and figure out the word that way. Although there are no distinct gaps between the syllables when I speak quickly, you can still identify four clear parts of the sound pattern—and guess that these must be the separate syllables in the word. So looking for syllables (or phonemes) is a pretty good way of understanding speech if you don't know all the words.

Artworks: I made these "rhinoceros" images on an ordinary laptop using an excellent, free sound-recording program called Audacity.

Find out more

On this website

Books

Articles

News

Technical

Patents

Patents are a great way to explore the detailed technical gritty of how inventions work. There are dozens covering different kinds of speech recognition; here are a few representative examples:

Videos

Sponsored links

Please do NOT copy our articles onto blogs and other websites

Text copyright © Chris Woodford 2006, 2012. All rights reserved. Full copyright notice and terms of use.

Follow us

Rate this page

Please rate or give feedback on this page and I will make a donation to WaterAid.

Share this page

Press CTRL + D to bookmark this page for later or tell your friends about it with:

Cite this page

Woodford, Chris. (2006) Voice recognition software. Retrieved from http://www.explainthatstuff.com/voicerecognition.html. [Accessed (Insert date here)]

More to explore on our website...

Back to top