Voice recognition software

Last updated: December 10, 2006.
The idea of talking to your computer
and have it understand you
sounds like pure science fiction—but it's not so far-fetched. I wrote
the words you're reading now not by typing them laboriously on my
keyboard but by reading them into a microphone
hooked up to my
computer. A voice recognition
program (sometimes called a speech
recognition program or language recognition software) identified my
words and typed them out for me automatically. Not bad, eh? But how
exactly does it work?
Language in pieces
If you want to understand how a car engine works, you could take it
to bits and study the pieces. Language is like this too. Our brains are
incredibly good at processing words—so good, in fact, that we don't
even think about how we do it most of the time. Consider for a moment
the way you're reading this page. Your eyes are scanning the text line
by line, recognizing small meaningful chunks of information (the words)
sometimes just from the way they look and sometimes by reading the
letters that make them up. (We've seen most words so many times that we
don't need to study each letter before we recognize them.) The basic
building blocks of written language are the 26 letters of the alphabet.
When it comes to understanding spoken information, things work a
similar way. Instead of looking for words, we listen for them. Ideally,
we can tell where one word ends and another begins by listening for a
short period of silence in between them. Just as there are two ways of
recognizing written text (word-by-word or letter-by-letter), so there
are two ways of recognizing spoken text. One is to identify the sound
of an entire word. If someone shouts your name ("Bob!"), you don't have
to think to yourself "B-O-B—that sounds like my name": you just get the
word immediately. Another way is to identify the component sounds that
make up a word and then deduce from them what the whole word is. This
is a bit like reading words by identifying the letters that make them
up.
Voice recognition programs can recognize people's speech through a
combination of these techniques. Just as written words in English are
made up of 26 possible letters, so spoken words are made up of 44
possible sounds known as phonemes. Crudely speaking, phonemes
correspond to the syllables in words (there's a bit more to phonemes
than that, but that's an easy way to think of it). In theory, a
computer could understand anything you said if you trained it to
recognize the 44 basic phonemes. Put another way, if you spoke any
word, all your computer would have to do would be to split the overall
word sound into its component phonemes, identify what letter sounds
those phonemes represented, and then it would be able to figure out the
word.
Finding patterns
Language recognition software is quite a bit more sophisticated than
this suggests. For example, it has to be able to cope with our
sloppiness as speakers. We don't always say words exactly the same way.
Our voices rise and fall in pitch and volume to convey different
meanings or emotions. Sometimes we talk quickly, sometimes slowly. If
you're as logical as a computer and you have to analyze something as
variable as the human voice, things can get pretty tricky.
So voice recognition programs use a variety of statistical, pattern
recognition techniques to help them. The more sophisticated programs
are "taught" about the grammatical structure of language so they know
which sorts of sentences make sense. Most of them have built in
dictionaries and know which words tend to follow one another. For
example, if you say "thank" and then a short word that sounds like
"dew", the computer can guess that you mean "you" because "thank" and
"you" often go together. Some programs also learn to recognize words
you say often so they can guess that you mean "Constantinople" and not
"Can't stand the noble". Computer programs can pull off these neat
tricks using mathematical "guessing processes" called Hidden Markov
models. A more sophisticated technique is to use a neural
network—a computer program that learns to recognize recurring
patterns in a broadly similar way to the human brain.
Voice recognition in the real world
Even if you've never dictated text to your computer, you've probably
used voice recognition. Many cell phones
have a feature called voice
activated dialing. The first time you enter a friend's telephone number
into your phone's address book, you might be asked to speak their name
into the phone at the same time. Then, in future, whenever you want to
call them, you just speak their name instead of dialling their number.
You just say "Call Bob" and the phone uses voice matching software to
recognize who you want. Many telephone and utility companies with
automated switchboards also use voice recognition. Instead of pressing
numbers on your telephone keypad to select different menu options, you
can speak the option you want aloud. Voice recognition software built
into the telephone system has been trained to recognize the voices of
hundreds of different people—so it knows you mean option "three" and
not option "nine".
One of the great challenges of off-the-shelf voice recognition
programs is to be able to recognize all kinds of different voices—male
and female, old and young, and people of all different nationalities
and races. Most programs are trained so they can recognize a wide range
of different voices. Before you can use a voice recognition program
properly, you generally have to put it through a short period of
training so it gets used to the peculiarities of your own voice. That
involves reading out a piece of text that the program already knows and
correcting any mistakes that it makes. Sometimes just a few minutes
training is enough to get your voice dictation program up and running.
The more you use the program, the better it gets (provided you always
remember to correct your mistakes).
The challenge for an ordinary, off-the-shelf voice recognition
program is to understand the similarities between voices that may be
very different. But there is another important application of voice
recognition where the differences—not the similarities—are vitally
important. In criminal investigation, police officers often have
evidence in the form of telephone messages or recorded conversations.
If they later find a suspect whom they need to match to this evidence,
they can analyse the sound pattern of that person's voice and compare
it to the sound pattern of the recorded evidence that they have
already. If the two sound patterns match, they can be reasonably
confident the suspect made the original recording.
Finding phonemes
Have you ever recorded your voice onto a tape recorder or computer that
shows the sound level of the recording? If so, you'll know that the
words you speak vary predictably in volume. Each word starts off
quietly (in a phase called "attack"). After reaching a maximum level,
the volume drops (or "decays") to a certain level, stays at that level
(known as "sustain") for a while, and then falls off to silence (a
final phase called "release"). The pattern of attack, decay, sustain
and release (or ADSR) that a sound follows is called an envelope. The
shape of the envelope is what makes the sound of your voice different
from the sound of a piano playing a note at the same pitch (or the
sound of any other instrument, for that matter).
If I say the word rhinoceros very slowly, there are four separate
envelopes split up by a short period of silence. Each of these
envelopes matches one of the syllables in the word:
What you see here is a graph showing how the volume of the sound
(the vertical axis) changes with time (the horizontal axis).
If I say the word at normal speed, the four envelopes run together into
something of a blur. Note that there is no gap between the syllables
when I say the word properly. Also notice how the word starts off
really loud and ends really quietly:
Now imagine you're a computer trying to recognize what I say. You
could recognize the complete pattern of the word "rhinoceros"—but that
would mean you have to learn the sound pattern of every possible word
in the dictionary. Or you could try to identify the separate syllables
and figure out the word that way. Although there are no distinct gaps
between the syllables when I speak quickly, you can still identify four
clear parts of the sound pattern—and guess that these must be the
separate syllables in the word. So looking for syllables (or phonemes)
is a pretty good way of understanding speech if you don't know all the
words.
We made these "rhinoceros" images on an
ordinary laptop using an excellent, free sound-recording program called
Audacity.