Menu ▼

Thoughts on voice interfaces

Talking to inanimate objects usually meant people need professional help. Sometimes we say nasty things to a leaking washing machine or a TV showing a football match, but we never talk to a fridge or a hairdryer. Things are slowly changing with phones and computers, but we’re still using it to talk to other people, not devices. What happens when there isn’t a person on the other side? And why aren’t we doing it more?

Slow adoption

Despite speech recognition being available in consumer electronics for a long time, it hasn’t been used much in everyday situations. I think there are a couple of reasons:

  1. Technology wasn’t good enough
    I don’t remember one piece of everyday technology that was doing it right. For example, I was trying it out on Windows Mobile a while back. There were pre-set voice commands, but they were very limited and if my pronunciation wasn’t perfect or if I said a non-English name, I hit a wall. To be fair, it’s a very tough problem to solve on a computer even if humans understand each other without problems. But we’re seeing some good progress recently. Ubiquity of Internet access on many devices and powerful software running in the cloud really makes a difference in the last year or two.

  2. Foreign languages
    If people don’t speak English or some other major language, they are out of luck. Support for non-dominant languages is still not very functional.

  3. Social norms
    Many notes, commands and searches are awkward when said out loud in public transit or while waiting in a queue, even if they could be perfectly normal in proper context. And there are some places, like classrooms, where talking is frowned upon.

  4. “Isn’t that from a sci-fi movie?”
    Believe it or not, there are people who don’t know speech recognition is functional and available on their smartphones today. It hasn’t been until only recently and nobody told them something changed. If people don’t know something is possible, they won’t try it.


I think there are many areas where voice will be the optimal input method and that will push adoption and quality.

A kid playing on an iPad. Image credits

  1. Young generations
    Kids start to talk before they can write so this could be their second way of interacting with technology after touch. “Play cartoons with cute bears” is something a four year old can say, but not write. They will carry those interactions into their adult age and they won’t feel so awkward.

  2. Speed
    Unless a person is typing all day for a living, speech is a faster input method than a keyboard. When we get to negligible number of speech recognition errors, it could replace a lot of typing. Just think of transcribing interviews, meeting notes, medical examinations or chatting in real time with someone who has to receive text. Benefits like that will shift a lot of usage to voice.

  3. Accessibility
    It will be a game-changer for visually impaired people or those with physical disabilities (eg. tremors or repetitive stress injuries). There are many others who will find it useful, of course.

  4. Context
    Driving, playing games and cooking are some activities where hands are busy and voice is free to ask and command.

A perfect example

Google Translate Android application.

I received a new version of my health insurance contract last week. It was a couple of pages in German and I could get the gist, but there were a lot of unknown words; they tend to be important in contracts, especially in fine print. I opened Google Translate app and selected voice input. Every time I stumbled upon an unknown word, I would say it in German and get the English word spoken back. It was so fast I was able to read the whole document in a matter of minutes and it was almost like having a native speaker by my side.

Who knows where it will take us in a couple of years, but I’m sure it will be interesting.

Previous blog post:
Git alias

Stay up to date:
Email · RSS feed · LinkedIn · Mastodon

Back to top ▲