← Back to ISMLA Bulletin Archive
ISMLA Bulletin — June 2021

The Rise of the Machines

Author: George Van den Bergh (thisislanguage.com)

If you would like a linguistic laugh you should head over to https://ttsmp3.com/ and paste in any poem in a language of your choice. I chose “Déjeuner du matin” not because I have always found it infinitely depressing but because I felt that its lack of punctuation would brighten up an unspeakably drab April morning. Give it a go now; you might end up crying into your coffee but hopefully they will be tears of laughter rather than anything else. (French/Léa gives a particularly chipper rendition of the final stanza.)

For those of you unfamiliar with it, TTS (text to speech) is a technology which provides “advanced deep learning technologies to synthesize natural sounding human speech” from text (source, Amazon Web Services). What this means is that any text you plug in can be read back to you in a range of accents and languages. The two biggest engines that are humming to this tune are Amazon’s Polly and Google’s Cloud Text-to-Speech. It’s an incredibly sophisticated software that delivers impressive (if sometimes eerie) results and is already used by many companies around the world to deliver faster (and cheaper) customer service triage as well as streamlining accessibility issues for websites.

So why does all this make me feel uneasy? After all, I’m the CEO of a technology company. Shouldn’t I be all in favour of these kinds of advances? And think of all the money we’d save as a company not having to hire native speakers to record audio for our product…

First and foremost, I’m a languages teacher (French and Spanish at Rugby School, 2004-6 – a big salut to everyone still teaching there who showed me the ropes) and I suppose what this means is that the use of synthesised human voices for the purposes of demonstrating native speech is anathema to me as – I imagine – it is to most of you reading this article. No matter how close the underlying software gets to reproducing the human voice, it will never be the human voice. Why, therefore, have we seen a rise in the use of such computer generated audio in numerous language-teaching platforms? A rise which is even more baffling when you consider that some of these platforms are paid-for resources used by UK schools, where individual words and sometimes even whole phrases are stitched together by a computer network and held up as valuable preparation for examinations? We language teachers have a tough enough job as it is trying to create immersive environments for our students given the strictures of timetabling. Why would we waste precious time showcasing synthetic voices instead of the real thing, a practice which could cause long-lasting damage to the “ear” of our students?

I’m not sure I know the answer. Expediency and cost-cutting certainly seem to be the key from the vendors’ side of things. After all, if language companies can simply plug into a 3rd party application to generate – in seconds – the audio-files for vocabulary lists and phrase-books, why wouldn’t they do away with the need to hire actual people to record those texts for them? “A fiddly business, this language-teaching thing is, isn’t it?,” you can imagine them saying. “I’m sure none of our clients will notice.”

The thing is, at thisislanguage.com we love natural speech. We have a hunger for the human voice, go gaga for the glottal stop and are the cheerleaders of chipper chat. We wax lyrical about the quirks of native-speaker audio and the subtle rhythms and emphases which make our voices human. This, after all, is what we should be preparing our students for. And if we have to pay to have 10,000+ words recorded by real humans to use in our Vocabulary Trainer, so be it.

There is, of course, a place for the use of AI in second-language acquisition and that is at the receptive end of the spectrum, where deep learning neural network algorithms can receive language produced by the student and go a long way to deciphering, grading and encouraging the student where they went wrong. In fact, we’ve already started testing out this technology to see if it might be of use to our clients.

Perhaps I have grown long in the tooth or am being a stick-in-the-mud. I am reminded of T.S. Eliot’s,  I grow old … I grow old …I shall wear the bottoms of my trousers rolled. (Try putting that into ttsmp3.com: British English/Brian is haunting). Does any of this matter when even my father seems to chat merrily away to Alexa, and the train station at Oxford – for I don’t know how many years now – has been using a computer generated voice to remind me, rather anachronistically, that smoking is not permitted inside the station?

Well, I cannot speak for teachers since I am now only in the classroom vicariously. Perhaps, at the very least, vendors should be forced to add a label to audio that has been computer generated, so that teachers and students are made aware of it? But in truth, I feel very deeply that technology companies who aim to teach languages should always try to use real, human-recorded audio for their modelling, whatever the cost. And for those who don’t, je prends ma tête dans ma main, et je pleure…