You are here: Home page > Computers > Speech synthesizers

A computer speaks the work Hi.

Speech synthesizers

by Chris Woodford. Last updated: March 25, 2023.

How long will it be before your computer gazes deep in your eyes and, with all the electronic sincerity it can muster, mutters those three little words that mean so much: "I love you"! In theory, it could happen right this minute: virtually every modern Windows PC has a speech synthesizer (a computerized voice that turns written text into speech) built in, mostly to help people with visual disabilities who can't read tiny text printed on a screen. How exactly do speech synthesizers go about converting written language into spoken? Let's take a closer look!

Artwork: Humans don't communicate by printing words on their foreheads for other people to read, so why should computers? Thanks to smartphone agents like Siri, Cortana, and "OK Google," people are slowly getting used to the idea of speaking commands to a computer and getting back spoken replies.

What is speech synthesis?

Computers do their jobs in three distinct stages called input (where you feed information in, often with a keyboard or mouse), processing (where the computer responds to your input, say, by adding up some numbers you typed in or enhancing the colors on a photo you scanned), and output (where you get to see how the computer has processed your input, typically on a screen or printed out on paper). Speech synthesis is simply a form of output where a computer or other machine reads words to you out loud in a real or simulated voice played through a loudspeaker; the technology is often called text-to-speech (TTS).

Talking machines are nothing new—somewhat surprisingly, they date back to the 18th century—but computers that routinely speak to their operators are still extremely uncommon. True, we drive our cars with the help of computerized navigators, engage with computerized switchboards when we phone utility companies, and listen to computerized apologies on railroad stations when our trains are running late. But hardly any of us talk to our computers (with voice recognition) or sit around waiting for them to reply. Professor Stephen Hawking was a truly unique individual—in more ways than one: can you think of any other person famous for talking with a computerized voice? All that may change in future as computer-generated speech becomes less robotic and more human.

How does speech synthesis work?

Let's say you have a paragraph of written text that you want your computer to speak aloud. How does it turn the written words into ones you can actually hear? There are essentially three stages involved, which I'll refer to as text to words, words to phonemes, and phonemes to sound.

1. Text to words

Reading words sounds easy, but if you've ever listened to a young child reading a book that was just too hard for them, you'll know it's not as trivial as it seems. The main problem is that written text is ambiguous: the same written information can often mean more than one thing and usually you have to understand the meaning or make an educated guess to read it correctly. So the initial stage in speech synthesis, which is generally called pre-processing or normalization, is all about reducing ambiguity: it's about narrowing down the many different ways you could read a piece of text into the one that's the most appropriate. [1]

Preprocessing involves going through the text and cleaning it up so the computer makes fewer mistakes when it actually reads the words aloud. Things like numbers, dates, times, abbreviations, acronyms, and special characters (currency symbols and so on) need to be turned into words—and that's harder than it sounds. The number 1843 might refer to a quantity of items ("one thousand eight hundred and forty three"), a year or a time ("eighteen forty three"), or a padlock combination ("one eight four three"), each of which is read out slightly differently. While humans follow the sense of what's written and figure out the pronunciation that way, computers generally don't have the power to do that, so they have to use statistical probability techniques (typically Hidden Markov Models) or neural networks (computer programs structured like arrays of brain cells that learn to recognize patterns) to arrive at the most likely pronunciation instead. So if the word "year" occurs in the same sentence as "1843," it might be reasonable to guess this is a date and pronounce it "eighteen forty three." If there were a decimal point before the numbers (".843"), they would need to be read differently as "eight four three."

How context determines the sounds that a speech synthesizer reads out

Artwork: Context matters: A speech synthesizer needs some understanding of what it's reading.

Preprocessing also has to tackle homographs, words pronounced in different ways according to what they mean. The word "read" can be pronounced either "red" or "reed," so a sentence such as "I read the book" is immediately problematic for a speech synthesizer. But if it can figure out that the preceding text is entirely in the past tense, by recognizing past-tense verbs ("I got up... I took a shower... I had breakfast... I read a book..."), it can make a reasonable guess that "I read [red] a book" is probably correct. Likewise, if the preceding text is "I get up... I take a shower... I have breakfast..." the smart money should be on "I read [reed] a book."

2. Words to phonemes

Having figured out the words that need to be said, the speech synthesizer now has to generate the speech sounds that make up those words. In theory, this is a simple problem: all the computer needs is a huge alphabetical list of words and details of how to pronounce each one (much as you'd find in a typical dictionary, where the pronunciation is listed before or after the definition). For each word, we'd need a list of the phonemes that make up its sound.

What are phonemes?

Voiceprint

Crudely speaking, phonemes are to spoken language what letters are to written language: they're the atoms of spoken sound—the sound components from which you can make any spoken word you like. The word cat consists of three phonemes making the sounds /k/ (as in can), /a/ (as in pad), and /t/ (as in tusk). Rearrange the order of the phonemes and you could make the words "act" or "tack."

There are only 26 letters in the English alphabet, but over 40 phonemes. That's because some letters and letter groups can be read in multiple ways (a, for example, can be read differently, as in 'pad' or 'paid'), so instead of one phoneme per letter, there are phonemes for all the different letter sounds. Some languages need more or fewer phonemes than others (typically 20-60).

Okay, that's phonemes in a nutshell—a hugely simplified definition that will do us for starters. If you want to know more, start with the article on phonemes in Wikipedia.

In theory, if a computer has a dictionary of words and phonemes, all it needs to do to read a word is look it up in the list and then read out the corresponding phonemes, right? In practice, it's harder than it sounds. As any good actor can demonstrate, a single sentence can be read out in many different ways according to the meaning of the text, the person speaking, and the emotions they want to convey (in linguistics, this idea is known as prosody and it's one of the hardest problems for speech synthesizers to address). Within a sentence, even a single word (like "read") can be read in multiple ways (as "red"/"reed") because it has multiple meanings. And even within a word, a given phoneme will sound different according to the phonemes that come before and after it.

An alternative approach involves breaking written words into their graphemes (written components units, typically made from the individual letters or syllables that make up a word) and then generating phonemes that correspond to them using a set of simple rules. This is a bit like a child attempting to read words he or she has never previously encountered (the reading method called phonics is similar). The advantage of doing that is that the computer can make a reasonable attempt at reading any word, whether or not it's a real word stored in the dictionary, a foreign word, or an unusual name or technical term. The disadvantage is that languages such as English have large numbers of irregular words that are pronounced in a very different way from how they're written (such as "colonel," which we say as kernel and not "coll-o-nell"; and "yacht," which is pronounced "yot" and not "yach-t") —exactly the sorts of words that cause problems for children learning to read and people with what's known as surface dyslexia (also called orthographic or visual dyslexia).

3. Phonemes to sound

Okay, so now we've converted our text (our sequence of written words) into a list of phonemes (a sequence of sounds that need speaking). But where do we get the basic phonemes that the computer reads out loud when it's turning text into speech? There are three different approaches. One is to use recordings of humans saying the phonemes, another is for the computer to generate the phonemes itself by generating basic sound frequencies (a bit like a music synthesizer), and a third approach is to mimic the mechanism of the human voice.

Concatenative

Speech synthesizers that use recorded human voices have to be preloaded with little snippets of human sound they can rearrange. In other words, a programmer has to record lots of examples of a person saying different things, break the spoken sentences into words and the words into phonemes. If there are enough speech samples, the computer can rearrange the bits in any number of different ways to create entirely new words and sentences. This type of speech synthesis is called concatenative (from Latin words that simply mean to link bits together in a series or chain). Since it's based on human recordings, concatenation is the most natural-sounding type of speech synthesis and it's widely used by machines that have only limited things to say (for example, corporate telephone switchboards). It's main drawback is that it's limited to a single voice (a single speaker of a single sex) and (generally) a single language.

Formant

If you consider that speech is just a pattern of sound that varies in pitch (frequency) and volume (amplitude)—like the noise coming out of a musical instrument—it ought to be possible to make an electronic device that can generate whatever speech sounds it needs from scratch, like a music synthesizer. This type of speech synthesis is known as formant, because formants are the 3–5 key (resonant) frequencies of sound that the human vocal apparatus generates and combines to make the sound of speech or singing.

Unlike speech synthesizers that use concatenation, which are limited to rearranging prerecorded sounds, formant speech synthesizers can say absolutely anything—even words that don't exist or foreign words they've never encountered. That makes formant synthesizers a good choice for GPS satellite (navigation) computers, which need to be able to read out many thousands of different (and often unusual) place names that would be hard to memorize. In theory, formant synthesizers can easily switch from a male to a female voice (by roughly doubling the frequency) or to a child's voice (by trebling it), and they can speak in any language. In practice, concatenation synthesizers now use huge libraries of sounds so they can say pretty much anything too. A more obvious difference is that concatenation synthesizers sound much more natural than formant ones, which still tend to sound relatively artificial and robotic. [2]

The difference between concatenative and formant speech synthesis

Artwork: Concatenative versus formant speech synthesis. Left: A concatenative synthesizer builds up speech from pre-stored fragments; the words it speaks are limited rearrangements of those sounds. Right: Like a music synthesizer, a formant synthesizer uses frequency generators to generate any kind of sound.

Articulatory

The most complex approach to generating sounds is called articulatory synthesis, and it means making computers speak by modeling the amazingly intricate human vocal apparatus. In theory, that should give the most realistic and humanlike voice of all three methods. Although numerous researchers have experimented with mimicking the human voicebox, articulatory synthesis is still by far the least explored method, largely because of its complexity. The most elaborate form of articulatory synthesis would be to engineer a "talking head" robot with a moving mouth that produces sound in a similar way to a person by combining mechanical, electrical, and electronic components, as necessary. [3]

What are speech synthesizers used for?

Work your way through a typical day and you might encounter all kinds of recorded voices, but as technology advances, it's getting harder to figure out whether you're listening to a simple recording or a speech synthesizer.

You might have an alarm clock that wakes you up by speaking the time, probably using crude, formant speech synthesis. If you have a speaking GPS system in your car, that might use either concatenated speech synthesis (if it has only a relatively limited vocabulary) or formant synthesis (if the voice is adjustable and it can read place names). If you have an ebook reader, maybe you have one with a built in narrator? If you're visually impaired, you might use a screen reader that speaks the words aloud from your computer screen (most modern Windows computers have a program called Narrator that you can switch on to do just this). Whether you use it or not, it's likely your cellphone has the ability to listen to your questions and reply through an intelligent personal assistant—Siri (iPhone), Cortana (Microsoft), or Google Assistant/Now (Android). If you're out and about on public transportation, you'll hear recorded voices all the time speaking safety and security announcements or telling you what trains and buses are coming along next. Are they simple recordings of humans... or are they using concatenated, synthesized speech? See if you can figure it out! One really interesting use of speech synthesis is in teaching foreign languages. Speech synthesizers are now so realistic that they're good enough for language students to use by way of practice.

The radio announcer's booth at a rodeo

Photo: Will humans still speak to one another in the future? All sorts of public announcements are now made by recorded or synthesized computer-controlled voices, but there are plenty of areas where even the smartest machines would fear to tread. Imagine a computer trying to commentate on a fast-moving sports event, such as a rodeo, for example. Even if it could watch and correctly interpret the action, and even if it had all the right words to speak, could it really convey the right kind of emotion? Photo by Carol M. Highsmith, courtesy of Gates Frontiers Fund Wyoming Collection within the Carol M. Highsmith Archive, Library of Congress, Prints and Photographs Division.

Who invented speech synthesis?

Talking computers sound like something out of science fiction—and indeed, the most famous example of speech synthesis is exactly that. In Stanley Kubrick's groundbreaking movie 2001: A Space Odyssey (based on the novel by Arthur C. Clarke), a computer called HAL famously chatters away in a humanlike voice and, at the end of the story, breaks into a doleful rendition of the song Daisy Bell (A Bicycle Built for Two) as an astronaut takes it apart.

Clip-art style illustration of Texas Instruments Speak & Spell (TM)

Artwork: Speak & Spell—An iconic, electronic toy from Texas Instruments that introduced a whole generation of children to speech synthesis in the late 1970s. It was built around the TI TMC0281 chip.

Here's a whistle-stop tour through the history of speech synthesis:

1769: Austro-Hungarian inventor Wolfgang von Kempelen develops one of the world's first mechanical speaking machines, which uses bellows and bagpipe components to produce crude noises similar to a human voice. It's an early example of articulatory speech synthesis.
1770s: Around the same time, Danish scientist Christian Kratzenstein, working in Russia, builds a mechanical version of the human vocal system, using modified organ pipes, that can speak the five vowels. In 1791, he writes a book on the subject titled Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine (Mechanism of Human Language with a Description of a Speaking Machine).
1837: English physicist and prolific inventor Charles Wheatstone, long fascinated by musical instruments and sound, rediscovers and popularizes an improved version of the von Kempelen speaking machine.
1928: Working at Bell Laboratories, American scientist Homer W. Dudley develops an electronic speech analyzer called the Vocoder (not to be confused with the famous voice-altering Vocoder used in many electronic pop records in the 1970s). Dudley develops the Vocoder into the Voder, an electronic speech synthesizer operated through a keyboard. A writer from The New York Times sees the device demonstrated at the 1939 World's Fair and declares "My God it talks!" Follow the link to the Bell website to hear a sample of Voder saying "Greetings everybody!"
1940s: Another American scientist, Franklin S. Cooper of Haskins Laboratories, develops a system called Pattern Playback that can generate speech sounds from their frequency spectrum.
1953: American scientist Walter Lawrence makes PAT (Parametric Artificial Talker), the first formant synthesizer, which makes speech sounds by combining four, six, and later eight formant frequencies.
1958: MIT scientist George Rosen develops a pioneering articulatory synthesizer called DAVO (Dynamic Analog of the VOcal tract).
1960s/1970s: Back at Bell Laboratories, Cecil Coker works on better methods of articulatory synthesis, while Joseph P. Olive develops concatenative synthesis.
1978: Texas Instruments releases its TMC0281 speech synthesizer chip and launches a handheld electronic toy called Speak & Spell, which uses crude formant speech synthesis as a teaching aid.
1984: Apple Macintosh computer ships with built-in MacInTalk speech synthesizer, widely used in popular songs such as Radiohead's Fitter Happier and Paranoid Android.
2001: AT&T introduces Natural Voices, a natural-sounding concatenative speech synthesizer based on a huge database of sound samples recorded from real people. The system is widely used in online applications, such as websites that can read emails aloud.
2011: Apple adds Siri, a voice-powered "intelligent agent," to its iPhone (smartphone).
2014: Microsoft announces Skype Translator, which can automatically translate a spoken conversation from one language into one of 40 others. The same year, Microsoft demonstrates Cortana, its own version of Siri.
2015: Amazon Echo, a personal assistant featuring voice software called Alexa, goes on general release.
2016: Google joins the club by releasing Google Assistant, its answer to Siri and Cortana, later incorporating it into Google Home.

Experiment for yourself!

Why not experience a bit of speech synthesis for yourself? Here are three examples of what the first sentence of this article sounds like read out by Microsoft Sam (a formant speech synthesizer built into Windows XP), Microsoft Anna (a more natural-sounding, formant synthesizer in Windows Vista and Windows 7), and Olivia (one of the voices in IBM's Watson, Text-to-Speech deep neural network synthesizer—read how it works). The first recording uses state-of-the art technology from the early 2000s; the second dates from about 2005–2010; and I made the third recording a decade later in 2021. Notice how much the technology has improved in two decades!

Sam (c. ~2001)

Anna (c. ~2005)

Olivia (c. ~2020)

If you have a modern computer (Windows or Mac), it's almost certainly got a speech synthesizer lurking in it somewhere:

Windows: The built-in text-to-speech program is called Narrator.
Mac: You'll need VoiceOver or on older Macs you could try using PlainTalk.
Linux: Experimental programs you can install include eSpeak, which is based on formant synthesis.
Web: There are various web-based synthesizers you can play with using any operating system, including Natural Reader, the Java-based FreeTTS, and a Firefox addon called Text to Speech. And don't forget IBM Watson Text-to-Speech, which is cloud-based.

Find out more

On this website

Articles

Brain Implant Can Say What You're Thinking by Megan Scudellari, IEEE Spectrum, April 24, 2019. A wonderful new piece of research could help brain-injured patients to get their voices back.
Chip Hall of Fame: Texas Instruments TMC0281 Speech Synthesizer: IEEE Spectrum, June 30, 2017. Celebrating the world's first speech synthesizer chip, released in 1978.
Forget Siri: Here's a New Way for Robots to Talk by Angelica Lim. IEEE Spectrum, October 29, 2015. It's not what you say but how you say it; new research explores how can robots with personality can have more natural conversations?
Siri and Cortana Sound Like Ladies Because of Sexism by Joao Medeiros. Wired, October 2015. Why do most speech synthesizers have female voices?
How Intel Gave Stephen Hawking a Voice by Joao Medeiros. Wired, January 13, 2015. Explores the technical challenges of updating voice software for the famous physics professor.
Skype's Real-Time Translator Previews English and Spanish by Jeremy Hsu. IEEE Spectrum. December 17, 2014. A brief look at Skype Translator, which converts spoken conversation in one language into another in real-time.
Speech Synthesizer Could 'Resurrect' Dead Singers by Rachel Kaufman. Wired, December 20, 2011. Could Vocaloid software bring classic singers back from the dead?
New voice for film critic Roger Ebert by Hayley Millar, BBC News, March 2010. How a Scottish company "rebuilt an actual voice" from recordings.
Sultan of sound by Tekla S. Perry. IEEE Spectrum, May 2, 2005. Exploring the scientific contributions of James L. Flanagan to the world's of speech recognition and synthesis.
The Quest For The Digital Chatterbox by Arik Hesseldahl, Forbes, March 2004. An overview of current speech synthesis applications.
The Mathematics of Artificial Speech by Alan Burdick, Discover magazine, January 2003. An introduction to the maths behind AT&T's Natural Voices.
Making Computers Talk by Andy Aaron, Ellen Eide and John F. Pitrelli, Scientific American, March 17, 2003. Describes IBM's efforts to develop concatenative speech synthesis.

Technical papers

A Short Introduction to Text-to-Speech Synthesis by Thierry Dutoit, TTS research team, TCTS Lab, Belgium.
A history of speech synthesis: A fascinating introduction to early mechanical speech synthesizers by Hartmut Traunmüller, Institute för lingvistik, Stockholms Universitet.
Review of Speech Synthesis Technology by Sami Lemmetty. A 1999 master's thesis produced at Helsinki University of Technology that reviews speech synthesis, including the history, theory, and applications. A good overview and introduction, though some of the information may now be dated.
System for the artificial production of vocal or other sounds by Homer W. Dudley, Bell Telephone Laboratories, US patent 2,121,142. June 21, 1938. Dudley's original patent for the Vocoder and Voder makes for interesting reading!
Speech Synthesis Bibliography Compiled by Joaquim Llisterri, Departament de Filologia Espanyola, Universitat Autònoma de Barcelona.

Books

Dutoit, Thierry. An Introduction to Text-to-Speech Synthesis. Springer Science & Business Media, 2013.
Taylor, Paul. Text-to-Speech Synthesis. Cambridge, England: Cambridge University Press, 2009.
Raphael, Lawrence J. et al. Speech Science Primer: Physiology, Acoustics, and Perception of Speech. Philadelphia, PA: Lippincott Williams & Wilkins, 2007.
Olive, Joseph P. "The Talking Computer: Text to Speech Synthesis." In David Stork (ed): HAL's Legacy: 2001's Computer as Dream and Reality. Cambridge, MA: MIT Press, 1998.
Van Santen, Jan P. H. et al (ed). Progress in Speech Synthesis: Volume 1. Berlin: Springer, 1997.

Current research

Notes and references

↑ Pre-processing in described in more detail in "Chapter 7: Speech synthesis from textual or conceptual input" of Speech Synthesis and Recognition by Wendy Holmes, Taylor & Francis, 2002, p.93ff.
↑ For more on concatenative synthesis, see Chapter 14 ("Synthesis by concatenation and signal-process modification") of Text-to-Speech Synthesis by Paul Taylor. Cambridge University Press, 2009, p.412ff.
↑ For a much more detailed explanation of the difference between formant, concatenative, and articulatory synthesis, see Chapter 2 ("Low-lever synthesizers: current status") of Developments in Speech Synthesis by Mark Tatham, Katherine Morton, Wiley, 2005, p.23–37.

Please do NOT copy our articles onto blogs and other websites

Articles from this website are registered at the US Copyright Office. Copying or otherwise using registered works without permission, removing this or other copyright notices, and/or infringing related rights could make you liable to severe civil or criminal penalties.

Follow us on → Facebook
and find our photos on → Flickr

Rate this page

Please rate or give feedback on this page and I will make a donation to WaterAid.

Tell your friends

If you've enjoyed this website, please kindly tell your friends about us on your favorite social sites.

Press CTRL + D to bookmark this page for later, or email the link to a friend.

Cite this page

Woodford, Chris. (2011/2021) Speech synthesizers. Retrieved from https://www.explainthatstuff.com/how-speech-synthesis-works.html. [Accessed (Insert date here)]

Bibtex

@misc{woodford_speech_synthesis, author = "Woodford, Chris", title = "Speech synthesizers", publisher = "Explain that Stuff", year = "2011", url = "https://www.explainthatstuff.com/how-speech-synthesis-works.html", urldate = "2023-03-25" }

More to explore on our website...

↑ Back to top