Inspired by the omnipresent AIs of in SCIFI and emboldened by VOICE ON THE WEB a talk by Matt Buck I experimented with Voice Interfaces for the WordPress.com API with Alexa. This is what I learned from that experience.
What Is So Interesting About Voice Interfaces?
Natural language interfaces would allow complex systems to be accessible to everyone.
– James F. Allen, from Natural Language Understanding
They improve accessibility and remove limitations inherent to GUIS by being ‘intent-first’.
Right now computer interfaces present the user a bunch of buttons, and the user needs to guess which one matches their intent. The user is doing all the interpreting. The user is training themselves to use the interface, the interface doesn’t change to suit them. This restricts the interface significantly. Everything it can do has a one to one correlation with some behavior determined by developers.
At the same time, the world that Jobs built with the GUI is reaching its natural limits. Our immensely powerful onscreen interfaces require every imaginable feature to be hand-coded, to have an icon or menu option. Think about Photoshop or Excel: Both are so massively capable that using them properly requires bushwhacking through a dense jungle of keyboard shortcuts, menu trees, and impossible-to-find toolbars.
– David Pierce, We’re on the Brink of a Revolution in Crazy-Smart Digital Assistants via Wired
Voice interfaces, however, will be intent first. They will be more conversational so they will be able to ask users for clarification. The user and the interface will be splitting the work of interpretation, of matching behavior with intent.
This means that users won’t have to be trained on interfaces, the voice interface will work with them. This seems immediately useful for the older crowd.
But it also removes restrictions on the interface. Since voice interfaces won’t need one to one correlations between behavior and buttons, they will be able to handle complex, non-analog behavior.
The revolution in voce interfaces is right around the corner.
Voice interfaces have been around for decades, but they only recently have become mainstream. Siri was released just five years ago. Although it has been such a short time, they have improved by leaps and bounds.
Most importantly, to a platform developer like myself, they have recently become abstractable or useable in my own code. Voice Interface API’s are now available to use in many different contexts. The goal for these APIs is that developers like myself can use them without having to understand their workings.
The potential of Voice APIs for users is that their voice interfaces will now have the same ever increasing capabilities as their devices. New Apps will be made daily for the voice interface.
Alexa is a Voice Assistant similar to Siri and Google Voice. It allows developers to write their own applications that work with it. Siri has this ability too, however it restricts developers to a narrow standardized set of commands. Additionally, Alexa comes packaged in an Amazon Echo Dot, a really cheap(50$!) hardware device.
Alexa also has documentation from Amazon, and lots of code samples for connecting it with Amazon’s suite of cloud-based services. It also has several testing interfaces for each stage of the application and they work well together.
Armed with an Alexa Dot, AWS Lambda, and a Code Sample, I Built my First Skill
This skill I built receives a domain from the user and returns the title of the latest post from that blog if it is on the WordPress.com API.
Initially, I thought that the Alexa part would be easy. I was more worried about querying the REST API for the blog. Since it was easy to pull up the blog with the REST API I thought it was very achievable. This was how I originally thought that Alexa worked:
Since Alexa was a black box, I figured that I was in the clear after I got my lambda script to successfully work with the text-based Alexa Development Tester.
However, I was in for a shock. Though the code worked just fine with text commands, it yielded comical results when I used voice commands. When I was saying this:
read a post from deeptiboddapati.com
get me a post from tommcfarlin.com
She was hearing:
read a post from d. t. butter potty dot com
get me a post from todd mcfarlane dot com
Butter potty…Ouch, reminds me of third grade.
Why was it working so well with the text based interface and so badly with the voice based one?
My initial assumption about Alexa as a black box was wrong. The text processing and, voice processing can’t be lumped together. This is how she really works:
Though I was visualizing Alexa receiving a bunch of text for input, this is what it actually receives:
These waveforms are turned into text by the Automated Speech Recognition(ASR) system as shown in the diagram. I also have little control over the ASR system. Since they are text, the expected input definitions get sent to the NLU system, not the ASR system.
This is why the text-based Alexa Development Tester gave the all clear even though the voice interface didn’t work.
So what would I say to any other platform developer working on a Voice Interface integration now?
Keep the limitations of the Speech Recognition system in mind.
The speech recognition system relies on probability to guess what is being said. It’s easy to be accurate with common words, but long strange strings like domain names are very difficult. People’s names are likewise difficult, especially if they aren’t English in origin. Indian names like mine prove very difficult.
So, developers building apps for voice interfaces need to make them more specific and specialized for now. When I modified my skill to pull from a select number of blogs and pair each domain with an easier short form version, it worked great. If you want the voice interface to work for new things, you need to do some work in narrowing down the context. By narrowing down the possible blogs to a handful I made the interface useable and useful.
If you want to learn more about creating and submitting Alexa Skills check out this great post: