How long will it be before your computer
gazes deep in your eyes and, with all
the electronic sincerity it can muster, mutters those three little
words that mean so much: "I love you"! In theory, it could happen
right this minute: virtually every modern Windows PC has a speech
synthesizer (a computerized voice that turns written text into
speech) built in, mostly to help people with visual disabilities who
can't read tiny text printed on a screen. How exactly do speech
synthesizers go about converting written language into spoken? Let's take a closer look!
Photo: Humans don't communicate by printing words on their foreheads for other people to read, so why should computers? Thanks to smartphone agents like Siri, Cortana, and "OK Google," people are slowly getting used to
the idea of speaking commands to a computer and getting back spoken replies.
What is speech synthesis?
Computers do their jobs in three distinct stages called input (where you feed
information in, often with a keyboard or
mouse), processing (where
the computer responds to your input, say, by adding up some numbers
you typed in or enhancing the colors on a photo you scanned), and
output (where you get to see how the computer has processed your
input, typically on a screen or printed out on paper). Speech
synthesis is simply a form of output where a computer or other
machine reads words to you out loud in a real or simulated voice
played through a loudspeaker; the technology is often called
Talking machines are nothing new—somewhat surprisingly, they date back to
the 18th century—but computers that routinely speak to their
operators are still extremely uncommon. True, we drive our cars with
the help of computerized navigators, engage with computerized
switchboards when we phone utility companies, and listen to
computerized apologies on railroad stations when our trains are
running late. But hardly any of us talk to our computers (with voice recognition)
or sit around waiting for them to reply. Professor Stephen Hawking
is a truly unique individual—in more ways than one: can you think
of any other person who's famous for talking with a computerized voice?
All that may change in future as computer-generated speech becomes
less robotic and more human.
How does speech synthesis work?
Let's say you have a paragraph of written text that you want your computer
to speak aloud. How does it turn the written words into ones you can
actually hear? There are essentially three stages involved, which
I'll refer to as text to words, words to phonemes, and phonemes to sound.
1. Text to words
Reading words sounds easy, but if you've ever listened to a young child reading
a book that was just too hard for them, you'll know it's not as
trivial as it seems. The main problem is that written text
is ambiguous: the same written information can often mean more than
one thing and usually you have to understand the meaning or make an educated guess to read it correctly.
So the initial stage in speech synthesis, which is generally called
pre-processing or normalization, is all about reducing ambiguity:
it's about narrowing down the many different ways you could read a piece of text into
the one that's the most appropriate.
Preprocessing involves going
through the text and cleaning it up so the computer makes fewer
mistakes when it actually reads the words aloud. Things like numbers, dates, times,
abbreviations, acronyms, and special characters (currency symbols and so on)
need to be turned into words—and that's harder than it sounds.
The number 843 might refer to a quantity of items ("eight hundred
and forty three"), a year or a time ("eight forty three"), or a
padlock combination ("eight four three"), each of which is read
out slightly differently. While humans follow the sense of what's
written and figure out the pronunciation that way, computers
generally don't have the power to do that, so they have to use
statistical probability techniques (typically Hidden Markov Models) or neural networks (computer programs structured
like arrays of brain cells that learn to recognize patterns) to arrive at the most
likely pronunciation instead. So if the word "year" occurs in the same sentence as "843,"
it might be reasonable to guess this is a date and pronounce it "eight forty three."
If there were a decimal point before the numbers ("0.843"), they would need to be read differently as "eight four three."
Preprocessing also has to tackle homographs, words pronounced in different ways
according to what they mean. The word "read" can be pronounced
either "red" or "reed," so a sentence such as "I read the
book" is immediately problematic for a speech synthesizer. But if
it can figure out that the preceding text is entirely in the past
tense, by recognizing past-tense verbs ("I got up... I took a
shower... I had breakfast... I read a book..."), it can make a
reasonable guess that "I read [red] a book" is probably correct.
Likewise, if the preceding text is "I get up... I take a shower...
I have breakfast..." the smart money should be on "I read [reed]
2. Words to phonemes
Having figured out the words that need to be said, the speech synthesizer
now has to generate the speech sounds that make up those words. In
theory, this is a simple problem: all the computer needs is a huge
alphabetical list of words and details of how to pronounce each one
(much as you'd find in a typical dictionary, where the pronunciation
is listed before or after the definition). For each word, we'd need a
list of the phonemes that make up its sound.
What are phonemes?
Crudely speaking, phonemes are to spoken language what letters are to written
language: they're the atoms of spoken sound—the sound
components from which you can make any spoken word you like.
The word cat consists of three phonemes making the sounds /k/ (as in
can), /a/ (as in pad), and /t/ (as in tusk). Rearrange the
order of the phonemes and you could make the words "act" or "tack."
There are only 26 letters in the English alphabet, but over 40 phonemes. That's
because some letters and letter groups can be read in multiple ways
(a, for example, can be read differently, as in 'pad' or 'paid'),
so instead of one phoneme per letter, there are phonemes
for all the different letter sounds. Some languages need more
or fewer phonemes than others (typically 20-60).
Okay, that's phonemes in a nutshell—a hugely simplified definition that will do us for starters.
If you want to know more, start with the article on phonemes in Wikipedia.
In theory, if a computer has a dictionary of words and phonemes, all it
needs to do to read a word is look it up in the list and then
read out the corresponding phonemes, right? In practice, it's harder than it sounds.
As any good actor can demonstrate, a single sentence can be read out in many different ways according to
the meaning of the text, the person speaking, and the emotions they want to convey (in linguistics, this idea is known as
prosody and it's one
of the hardest problems for speech synthesizers to address). Within a sentence, even a single word (like "read") can be read
in multiple ways (as "red"/"reed") because it has multiple meanings. And even within a word,
a given phoneme will sound different according to the phonemes that come before and after it.
An alternative approach involves breaking written words into their graphemes
(written components units, typically made from the individual letters or syllables that make up a word) and then
generating phonemes that correspond to them using a set of simple rules. This is a bit like a child attempting to read words he or she has never
previously encountered (the reading method called phonics
is similar). The advantage of doing that is that the computer can make a reasonable attempt at reading any word, whether
or not it's a real word stored in the dictionary, a foreign word, or
an unusual name or technical term. The disadvantage is that languages
such as English have large numbers of irregular words that are
pronounced in a very different way from how they're written
(such as "colonel," which we say as kernel and not "coll-o-nell"; and "yacht," which is pronounced "yot" and not "yach-t")
—exactly the sorts of words that cause problems for children learning to read and people
with what's known as surface dyslexia (also called orthographic or visual dyslexia).
3. Phonemes to sound
Okay, so now we've converted our text (our sequence of written words) into a list of phonemes (a sequence of sounds
that need speaking). But where do we get the basic phonemes that the computer reads out loud when it's turning
text into speech? There are three different approaches. One is to use recordings of humans saying the phonemes, another is for the
computer to generate the phonemes itself by generating basic sound frequencies (a bit like a
music synthesizer), and a third approach is to mimic the mechanism of the human voice.
Speech synthesizers that use recorded human voices have to be preloaded with
little snippets of human sound they can rearrange. In other words,
a programmer has to record lots of examples of a person saying
different things, break the spoken sentences into words and the words
into phonemes. If there are enough speech samples, the computer can
rearrange the bits in any number of different ways to create entirely
new words and sentences. This type of speech synthesis is called
concatenative (from Latin words that simply mean to link bits
together in a series or chain). Since it's based on human recordings,
concatenation is the most natural-sounding type of speech synthesis
and it's widely used by machines that have only limited things to say
(for example, corporate telephone switchboards). It's main drawback is that it's limited to a single voice (a single
speaker of a single sex) and (generally) a single language.
If you consider that speech is just a pattern of sound that varies in pitch
(frequency) and volume (amplitude)—like the noise coming out of a
musical instrument—it ought to be possible to make an electronic
device that can generate whatever speech sounds it needs from scratch,
like a music synthesizer. This type of speech synthesis is known
as formant, because formants are the 3–5 key (resonant) frequencies of sound that
the human vocal apparatus generates and combines to make the sound of speech or singing. Unlike speech synthesizers that use
concatenation, which are limited to rearranging prerecorded sounds, formant
speech synthesizers can say absolutely anything—even words that don't exist
or foreign words they've never encountered. That makes formant synthesizers a good choice
for GPS satellite (navigation) computers, which need to be able to read out many thousands
of different (and often unusual) place names that would be hard to memorize. In theory, formant synthesizers can easily switch from a male to a female voice (by roughly doubling the frequency) or to a child's voice (by trebling it),
and they can speak in any language. In practice, concatenation synthesizers now use
huge libraries of sounds so they can say pretty much anything too. A
more obvious difference is that concatenation synthesizers sound much
more natural than formant ones, which still tend to sound relatively
artificial and robotic.
The most complex approach to generating sounds is called articulatory synthesis, and it means making computers speak by modeling the amazingly intricate human vocal apparatus. In theory, that should give the most realistic and humanlike voice of
all three methods. Although numerous researchers have experimented with mimicking the human voicebox, articulatory synthesis is still by far the least explored method, largely because of its complexity. The most elaborate form of articulatory synthesis would be to engineer a "talking head" robot with a moving mouth that produces sound in a similar way to a person by combining
mechanical, electrical, and electronic components, as necessary.
What are speech synthesizers used for?
Work your way through a typical day and you might encounter all kinds of
recorded voices, but as technology advances, it's getting harder to
figure out whether you're listening to a simple recording or
a speech synthesizer. You might have an alarm clock that wakes you up by speaking the time, probably
using crude, formant speech synthesis. If you have a speaking GPS
system in your car, that might use either concatenated speech
synthesis (if it has only a relatively limited vocabulary) or
formant synthesis (if the voice is adjustable and it can read place names).
If you have an ebook reader, maybe you have one with a built in
narrator? If you're visually impaired, you might use a screen reader
that speaks the words aloud from your computer screen (most modern
Windows computers have a program called Narrator that you can switch
on to do just this). Whether you use it or not,
it's likely your cellphone
has the ability to listen to your questions and
reply through an intelligent personal assistant—Siri (iPhone), Cortana (Microsoft),
or Google Assistant/Now (Android). If you're out and about on public
transportation, you'll hear recorded voices all the time speaking
safety and security announcements or telling you what trains and
buses are coming along next. Are they simple recordings of humans... or are they using
concatenated, synthesized speech? See if you can figure it out! One really
interesting use of speech synthesis is in teaching foreign languages.
Speech synthesizers are now so realistic that they're good enough for
language students to use by way of practice.
Who invented speech synthesis?
Talking computers sound like something out of science fiction—and indeed,
the most famous example of speech synthesis is exactly that. In
Stanley Kubrick's groundbreaking movie 2001: A Space Odyssey
(based on the novel by Arthur C. Clarke), a computer called HAL
famously chatters away in a humanlike voice and, at the end of the
story, breaks into a doleful rendition of the song Daisy Bell (A
Bicycle Built for Two) as an astronaut takes it apart.
Here's a whistle-stop tour through the history of speech synthesis:
1769: Austro-Hungarian inventor Wolfgang von Kempelen develops one of the world's first mechanical speaking machines,
which uses bellows and bagpipe components to produce crude noises similar to a human voice. It's an early
example of articulatory speech synthesis.
1770s: Around the same time, Danish scientist Christian Kratzenstein, working in Russia, builds a mechanical version
of the human vocal system, using modified organ pipes, that can
speak the five vowels. In 1791, he writes a book on the subject titled
Mechanismus der menschlichen Sprache nebst Beschreibung einer
sprechenden Maschine (Mechanism of Human Language with a Description of a Speaking Machine).
1837: English physicist and prolific inventor Charles Wheatstone, long fascinated by musical instruments and sound, rediscovers
and popularizes an improved version of the von Kempelen speaking machine.
1928: Working at Bell Laboratories, American scientist
Homer W. Dudley
develops an electronic speech analyzer called the Vocoder
(not to be confused with the famous voice-altering Vocoder
used in many electronic pop records in the 1970s). Dudley develops the Vocoder into the Voder, an electronic speech
synthesizer operated through a keyboard. A writer from The New
York Times sees the device demonstrated at the 1939 World's Fair
and declares "My God it talks!" Follow the link to the Bell website to hear
a sample of Voder saying "Greetings everybody!"
1940s: Another American scientist, Frank Cooper of Haskins Laboratories,
develops a system called Pattern Playback that can generate speech sounds from their frequency spectrum.
1953: American scientist Walter Lawrence makes PAT (Parametric Artificial Talker), the first formant synthesizer, which makes speech sounds by combining four, six, and later eight formant frequencies.
1958: MIT scientist George Rosen develops a pioneering articulatory synthesizer called DAVO (Dynamic Analog of the VOcal tract).
1960s/1970s: Back at Bell Laboratories, Cecil Coker
works on better methods of articulatory synthesis, while Joseph P. Olive
develops concatenative synthesis.
1978: Texas Instruments releases its TMC0281 speech synthesizer chip and launches a handheld electronic toy called
Speak & Spell, which uses crude formant speech synthesis as a teaching aid.
1984: Apple Macintosh computer ships with built-in MacInTalk speech
synthesizer, widely used in popular songs such as Radiohead's Fitter Happier and Paranoid Android.
2001: AT&T introduces Natural Voices, a natural-sounding concatenative
speech synthesizer based on a huge database of sound samples recorded from real people. The system is widely used in online applications, such as websites that can read emails aloud.
2011: Apple adds Siri, a voice-powered "intelligent agent," to its iPhone (smartphone).
2014: Microsoft announces Skype Translator, which can automatically translate a spoken conversation from one language into one of 40 others. The same year, Microsoft demonstrates Cortana, its own version of Siri.
2015: Amazon Echo, a personal assistant featuring voice software called Alexa, goes on general release.
2016: Google joins the club by releasing Google Assistant, its answer to Siri and Cortana, later incorporating it into Google Home.
Experiment for yourself!
Why not experience a bit of speech synthesis for yourself? Here are two examples of what the first sentence of this
article sounds like read out by Microsoft Sam (a formant speech synthesizer built into Windows XP) and Microsoft Anna (a more natural-sounding,
formant synthesizer in Windows Vista and Windows 7). Notice how much the technology improved just in the five years or so between those different speech
synthesizers being released.
If you have a modern computer (Windows or Mac), it's almost certainly got a speech synthesizer lurking in it somewhere:
Windows: The built-in text-to-speech program is called Narrator.
A history of speech synthesis: A fascinating introduction to
early mechanical speech synthesizers by Hartmut Traunmüller, Institute för lingvistik, Stockholms Universitet.
Review of Speech Synthesis Technology by Sami Lemmetty. A 1999 master's thesis produced at Helsinki University of Technology that reviews speech synthesis, including the history, theory, and applications. A good overview and introduction, though some of the information may now be dated.