Trying to create natural conversations in your skill was very, very, VERY cumbersome. I would say humans don’t think straight -luckily as this makes us way more interesting!- and we certainly aren’t very comfortable following scripts unless we are call center operators and we are armed with the appropriate tools. And we are genius synonym finders as long as we’re not writing an essay, then we get writer’s block! Defining the utterances in your skill was a combinatorial exercise, with a sheer number of combinations. I guess every developer had a strategy to deal with it. For me it was a relational database containing the possible values for expressions, and some SQL queries that would create all the possible permutations and populate the required JSON files on the go, then paste it all in the Amazon Developer console.
Keeping track of status in the skill was also hard. Say, we need to gather information to populate three slots. We can ask for slot 1, then slot 2, then slot 3. Or we could ask for slot 1, then slot 3, then slot 2. Or, in this order, slots 2, 1, 3. Or slots 2, 3, 1. Or slots 3, 1, 2. Or slots 3, 2, 1. That’s six combinations. But also users could decide to give us the information about two slots unprompted, in a single utterance. So it could be 1,2 then 3, or 1, 3, then 2, or 2, 3, then 1. Or… Surely you get the idea: managing this by hand was too hard. The alternative was a fixed order in the dialog with the user, which could feel stiff for the free spirits, or boring and robotic if the skill is to be used frequently.
Dialog Management was quite a stepchange since the early days of Skill creation, nevertheless the way to leverage it was via coding.
It’s no surprise I was very happy to read about the July 22nd announcement of Alexa Conversations. As the name suggests, it places the conversation at the core of the Skill creator experience (note I did not say “developer”). The definition of what a Skill is gets clearer: a set of possible dialogs between Alexa and the user whereby enough information is gathered, then Alexa provides some useful functionality to the user.
The cornerstone of Skills development is the creation of possible dialogs, specified, well, as a dialog in a novel: in turns.
- User: I want cookies.
- Alexa: What type of cookie you want?
- User: Round ones.
- Alexa: Do you like cinnamon?
- User: No, I hate it.
- Then I will bake a batch of chocolate chip cookies. Are you very hungry?
- User: Oh yes.
- Alexa: Then I’ll bake 24 chocolate chip cookies.
You should create as many dialogs as different combinations of turns are possible (hint: not the order, though). Another possible dialog for the example above could be:
- User: I would like to have 24 chocolate chip cookies.
- Alexa: OK, I’ll bake 24 chocolate chip cookies.
Once you are happy with your dialogs, you need to annotate each turn: what role in the conversation does the turn have (is Alexa asking for parameters? Is the user providing information? Is the user confirming what Alexa is suggesting?), what are other ways that you expect the user to say the same thing (these are called Utterance sets), what are other ways for Alexa to say the same thing (so that the Skill doesn’t feel monotonous if it’s used frequently), etc.
The second key concept is that of an API definition. The API is the provider of the service rendered by Alexa and it will be implemented programmatically. The API definition is just its representation: what parameters it takes as input, what kind of data it returns as output, and so on.
To build the Skill, Amazon will use its computational power to “train the model”. This is the creation of a program that takes into account all the possible combinations of how a conversation could happen: both the dialog sequences and the language elements (vocabulary, synonyms, grammatically equivalent sentences, etc.). Here’s a comprehensive explanation of the AI techniques involved . Tens of thousands of permutations are typically created, and this way the probability of Alexa reacting correctly to user input is highly increased.
So the bulk of the work shifts from development to a friendly user interface where the designer can focus on the voice experience rather than excessive technicalities.
Development of the API itself becomes a simple business of receiving parameters, probably invoking a mature API that is already being used by other clients (web, apps…), and passing the result back to the Skill.
If you’re a one-woman band (my case), you greatly appreciate the speed in which you can bring your concept to reality. If you’re a company you can better leverage the skills of your associates by having dedicated roles. Psychologists, marketeers, product owners, almost anyone who is web savvy and is a subject matter expert can work on the dialog curation part of building a Skill.