You are here: Home page > Computers > Speech synthesizers

A computer speaks the work Hi.

Speech synthesizers

How long will it be before your computer gazes deep in your eyes and, with all the electronic sincerity it can muster, mutters those three little words that mean so much: "I love you"! In theory, it could happen right this minute: virtually every modern Windows PC has a speech synthesizer (a computerized voice that turns written text into speech) built in, mostly to help people with visual disabilities who can't read tiny text printed on a screen. How exactly do speech synthesizers go about converting written language into spoken? Let's take a closer look!

Artwork: Humans don't communicate by printing words on their foreheads for other people to read, so why should computers? Thanks to smartphone agents like Siri, Cortana, and "OK Google," people are slowly getting used to the idea of speaking commands to a computer and getting back spoken replies.

Sponsored links

Contents

  1. What is speech synthesis?
  2. How does speech synthesis work?
  3. What are phonemes?
  4. What are speech synthesizers used for?
  5. Who invented speech synthesis?
  6. Experiment for yourself!
  7. Find out more

What is speech synthesis?

Computers do their jobs in three distinct stages called input (where you feed information in, often with a keyboard or mouse), processing (where the computer responds to your input, say, by adding up some numbers you typed in or enhancing the colors on a photo you scanned), and output (where you get to see how the computer has processed your input, typically on a screen or printed out on paper). Speech synthesis is simply a form of output where a computer or other machine reads words to you out loud in a real or simulated voice played through a loudspeaker; the technology is often called text-to-speech (TTS).

Talking machines are nothing new—somewhat surprisingly, they date back to the 18th century—but computers that routinely speak to their operators are still extremely uncommon. True, we drive our cars with the help of computerized navigators, engage with computerized switchboards when we phone utility companies, and listen to computerized apologies on railroad stations when our trains are running late. But hardly any of us talk to our computers (with voice recognition) or sit around waiting for them to reply. Professor Stephen Hawking was a truly unique individual—in more ways than one: can you think of any other person famous for talking with a computerized voice? All that may change in future as computer-generated speech becomes less robotic and more human.

Sponsored links

How does speech synthesis work?

Let's say you have a paragraph of written text that you want your computer to speak aloud. How does it turn the written words into ones you can actually hear? There are essentially three stages involved, which I'll refer to as text to words, words to phonemes, and phonemes to sound.

1. Text to words

Reading words sounds easy, but if you've ever listened to a young child reading a book that was just too hard for them, you'll know it's not as trivial as it seems. The main problem is that written text is ambiguous: the same written information can often mean more than one thing and usually you have to understand the meaning or make an educated guess to read it correctly. So the initial stage in speech synthesis, which is generally called pre-processing or normalization, is all about reducing ambiguity: it's about narrowing down the many different ways you could read a piece of text into the one that's the most appropriate. [1]

Preprocessing involves going through the text and cleaning it up so the computer makes fewer mistakes when it actually reads the words aloud. Things like numbers, dates, times, abbreviations, acronyms, and special characters (currency symbols and so on) need to be turned into words—and that's harder than it sounds. The number 1843 might refer to a quantity of items ("one thousand eight hundred and forty three"), a year or a time ("eighteen forty three"), or a padlock combination ("one eight four three"), each of which is read out slightly differently. While humans follow the sense of what's written and figure out the pronunciation that way, computers generally don't have the power to do that, so they have to use statistical probability techniques (typically Hidden Markov Models) or neural networks (computer programs structured like arrays of brain cells that learn to recognize patterns) to arrive at the most likely pronunciation instead. So if the word "year" occurs in the same sentence as "1843," it might be reasonable to guess this is a date and pronounce it "eighteen forty three." If there were a decimal point before the numbers (".843"), they would need to be read differently as "eight four three."

How context determines the sounds that a speech synthesizer reads out

Artwork: Context matters: A speech synthesizer needs some understanding of what it's reading.

Preprocessing also has to tackle homographs, words pronounced in different ways according to what they mean. The word "read" can be pronounced either "red" or "reed," so a sentence such as "I read the book" is immediately problematic for a speech synthesizer. But if it can figure out that the preceding text is entirely in the past tense, by recognizing past-tense verbs ("I got up... I took a shower... I had breakfast... I read a book..."), it can make a reasonable guess that "I read [red] a book" is probably correct. Likewise, if the preceding text is "I get up... I take a shower... I have breakfast..." the smart money should be on "I read [reed] a book."

2. Words to phonemes

Having figured out the words that need to be said, the speech synthesizer now has to generate the speech sounds that make up those words. In theory, this is a simple problem: all the computer needs is a huge alphabetical list of words and details of how to pronounce each one (much as you'd find in a typical dictionary, where the pronunciation is listed before or after the definition). For each word, we'd need a list of the phonemes that make up its sound.

What are phonemes?

Voiceprint

Crudely speaking, phonemes are to spoken language what letters are to written language: they're the atoms of spoken sound—the sound components from which you can make any spoken word you like. The word cat consists of three phonemes making the sounds /k/ (as in can), /a/ (as in pad), and /t/ (as in tusk). Rearrange the order of the phonemes and you could make the words "act" or "tack."

There are only 26 letters in the English alphabet, but over 40 phonemes. That's because some letters and letter groups can be read in multiple ways (a, for example, can be read differently, as in 'pad' or 'paid'), so instead of one phoneme per letter, there are phonemes for all the different letter sounds. Some languages need more or fewer phonemes than others (typically 20-60).

Okay, that's phonemes in a nutshell—a hugely simplified definition that will do us for starters. If you want to know more, start with the article on phonemes in Wikipedia.

In theory, if a computer has a dictionary of words and phonemes, all it needs to do to read a word is look it up in the list and then read out the corresponding phonemes, right? In practice, it's harder than it sounds. As any good actor can demonstrate, a single sentence can be read out in many different ways according to the meaning of the text, the person speaking, and the emotions they want to convey (in linguistics, this idea is known as prosody and it's one of the hardest problems for speech synthesizers to address). Within a sentence, even a single word (like "read") can be read in multiple ways (as "red"/"reed") because it has multiple meanings. And even within a word, a given phoneme will sound different according to the phonemes that come before and after it.

An alternative approach involves breaking written words into their graphemes (written components units, typically made from the individual letters or syllables that make up a word) and then generating phonemes that correspond to them using a set of simple rules. This is a bit like a child attempting to read words he or she has never previously encountered (the reading method called phonics is similar). The advantage of doing that is that the computer can make a reasonable attempt at reading any word, whether or not it's a real word stored in the dictionary, a foreign word, or an unusual name or technical term. The disadvantage is that languages such as English have large numbers of irregular words that are pronounced in a very different way from how they're written (such as "colonel," which we say as kernel and not "coll-o-nell"; and "yacht," which is pronounced "yot" and not "yach-t") —exactly the sorts of words that cause problems for children learning to read and people with what's known as surface dyslexia (also called orthographic or visual dyslexia).

3. Phonemes to sound

Okay, so now we've converted our text (our sequence of written words) into a list of phonemes (a sequence of sounds that need speaking). But where do we get the basic phonemes that the computer reads out loud when it's turning text into speech? There are three different approaches. One is to use recordings of humans saying the phonemes, another is for the computer to generate the phonemes itself by generating basic sound frequencies (a bit like a music synthesizer), and a third approach is to mimic the mechanism of the human voice.

Concatenative

Speech synthesizers that use recorded human voices have to be preloaded with little snippets of human sound they can rearrange. In other words, a programmer has to record lots of examples of a person saying different things, break the spoken sentences into words and the words into phonemes. If there are enough speech samples, the computer can rearrange the bits in any number of different ways to create entirely new words and sentences. This type of speech synthesis is called concatenative (from Latin words that simply mean to link bits together in a series or chain). Since it's based on human recordings, concatenation is the most natural-sounding type of speech synthesis and it's widely used by machines that have only limited things to say (for example, corporate telephone switchboards). It's main drawback is that it's limited to a single voice (a single speaker of a single sex) and (generally) a single language.

Formant

If you consider that speech is just a pattern of sound that varies in pitch (frequency) and volume (amplitude)—like the noise coming out of a musical instrument—it ought to be possible to make an electronic device that can generate whatever speech sounds it needs from scratch, like a music synthesizer. This type of speech synthesis is known as formant, because formants are the 3–5 key (resonant) frequencies of sound that the human vocal apparatus generates and combines to make the sound of speech or singing.

Unlike speech synthesizers that use concatenation, which are limited to rearranging prerecorded sounds, formant speech synthesizers can say absolutely anything—even words that don't exist or foreign words they've never encountered. That makes formant synthesizers a good choice for GPS satellite (navigation) computers, which need to be able to read out many thousands of different (and often unusual) place names that would be hard to memorize. In theory, formant synthesizers can easily switch from a male to a female voice (by roughly doubling the frequency) or to a child's voice (by trebling it), and they can speak in any language. In practice, concatenation synthesizers now use huge libraries of sounds so they can say pretty much anything too. A more obvious difference is that concatenation synthesizers sound much more natural than formant ones, which still tend to sound relatively artificial and robotic. [2]

The difference between concatenative and formant speech synthesis

Artwork: Concatenative versus formant speech synthesis. Left: A concatenative synthesizer builds up speech from pre-stored fragments; the words it speaks are limited rearrangements of those sounds. Right: Like a music synthesizer, a formant synthesizer uses frequency generators to generate any kind of sound.

Articulatory

The most complex approach to generating sounds is called articulatory synthesis, and it means making computers speak by modeling the amazingly intricate human vocal apparatus. In theory, that should give the most realistic and humanlike voice of all three methods. Although numerous researchers have experimented with mimicking the human voicebox, articulatory synthesis is still by far the least explored method, largely because of its complexity. The most elaborate form of articulatory synthesis would be to engineer a "talking head" robot with a moving mouth that produces sound in a similar way to a person by combining mechanical, electrical, and electronic components, as necessary. [3]

What are speech synthesizers used for?

Work your way through a typical day and you might encounter all kinds of recorded voices, but as technology advances, it's getting harder to figure out whether you're listening to a simple recording or a speech synthesizer.

You might have an alarm clock that wakes you up by speaking the time, probably using crude, formant speech synthesis. If you have a speaking GPS system in your car, that might use either concatenated speech synthesis (if it has only a relatively limited vocabulary) or formant synthesis (if the voice is adjustable and it can read place names). If you have an ebook reader, maybe you have one with a built in narrator? If you're visually impaired, you might use a screen reader that speaks the words aloud from your computer screen (most modern Windows computers have a program called Narrator that you can switch on to do just this). Whether you use it or not, it's likely your cellphone has the ability to listen to your questions and reply through an intelligent personal assistant—Siri (iPhone), Cortana (Microsoft), or Google Assistant/Now (Android). If you're out and about on public transportation, you'll hear recorded voices all the time speaking safety and security announcements or telling you what trains and buses are coming along next. Are they simple recordings of humans... or are they using concatenated, synthesized speech? See if you can figure it out! One really interesting use of speech synthesis is in teaching foreign languages. Speech synthesizers are now so realistic that they're good enough for language students to use by way of practice.

The radio announcer's booth at a rodeo

Photo: Will humans still speak to one another in the future? All sorts of public announcements are now made by recorded or synthesized computer-controlled voices, but there are plenty of areas where even the smartest machines would fear to tread. Imagine a computer trying to commentate on a fast-moving sports event, such as a rodeo, for example. Even if it could watch and correctly interpret the action, and even if it had all the right words to speak, could it really convey the right kind of emotion? Photo by Carol M. Highsmith, courtesy of Gates Frontiers Fund Wyoming Collection within the Carol M. Highsmith Archive, Library of Congress, Prints and Photographs Division.

Who invented speech synthesis?

Talking computers sound like something out of science fiction—and indeed, the most famous example of speech synthesis is exactly that. In Stanley Kubrick's groundbreaking movie 2001: A Space Odyssey (based on the novel by Arthur C. Clarke), a computer called HAL famously chatters away in a humanlike voice and, at the end of the story, breaks into a doleful rendition of the song Daisy Bell (A Bicycle Built for Two) as an astronaut takes it apart.

Clip-art style illustration of Texas Instruments Speak & Spell (TM)

Artwork: Speak & Spell—An iconic, electronic toy from Texas Instruments that introduced a whole generation of children to speech synthesis in the late 1970s. It was built around the TI TMC0281 chip.

Here's a whistle-stop tour through the history of speech synthesis:

Experiment for yourself!

Why not experience a bit of speech synthesis for yourself? Here are three examples of what the first sentence of this article sounds like read out by Microsoft Sam (a formant speech synthesizer built into Windows XP), Microsoft Anna (a more natural-sounding, formant synthesizer in Windows Vista and Windows 7), and Olivia (one of the voices in IBM's Watson, Text-to-Speech deep neural network synthesizer—read how it works). The first recording uses state-of-the art technology from the early 2000s; the second dates from about 2005–2010; and I made the third recording a decade later in 2021. Notice how much the technology has improved in two decades!

Sam (c. ~2001)

Anna (c. ~2005)

Olivia (c. ~2020)

If you have a modern computer (Windows or Mac), it's almost certainly got a speech synthesizer lurking in it somewhere:

Sponsored links

Find out more

On this website

Articles

Technical papers

Books

Current research

Notes and references

  1.    Pre-processing in described in more detail in "Chapter 7: Speech synthesis from textual or conceptual input" of Speech Synthesis and Recognition by Wendy Holmes, Taylor & Francis, 2002, p.93ff.
  2.    For more on concatenative synthesis, see Chapter 14 ("Synthesis by concatenation and signal-process modification") of Text-to-Speech Synthesis by Paul Taylor. Cambridge University Press, 2009, p.412ff.
  3.    For a much more detailed explanation of the difference between formant, concatenative, and articulatory synthesis, see Chapter 2 ("Low-lever synthesizers: current status") of Developments in Speech Synthesis by Mark Tatham, Katherine Morton, Wiley, 2005, p.23–37.

Please do NOT copy our articles onto blogs and other websites

Articles from this website are registered at the US Copyright Office. Copying or otherwise using registered works without permission, removing this or other copyright notices, and/or infringing related rights could make you liable to severe civil or criminal penalties.

Text copyright © Chris Woodford 2011, 2021. All rights reserved. Full copyright notice and terms of use.

Follow us

Rate this page

Please rate or give feedback on this page and I will make a donation to WaterAid.

Tell your friends

If you've enjoyed this website, please kindly tell your friends about us on your favorite social sites.

Press CTRL + D to bookmark this page for later, or email the link to a friend.

Cite this page

Woodford, Chris. (2011/2021) Speech synthesizers. Retrieved from https://www.explainthatstuff.com/how-speech-synthesis-works.html. [Accessed (Insert date here)]

Bibtex

@misc{woodford_speech_synthesis, author = "Woodford, Chris", title = "Speech synthesizers", publisher = "Explain that Stuff", year = "2011", url = "https://www.explainthatstuff.com/how-speech-synthesis-works.html", urldate = "2023-03-25" }

More to explore on our website...

Back to top