Development – The Voice of Alexa

Alexa Conversations is here!

Photo by https://www.pexels.com/@jopwell — https://www.pexels.com/@jopwell

Trying to create natural conversations in your skill was very, very, VERY cumbersome. I would say humans don’t think straight -luckily as this makes us way more interesting!- and we certainly aren’t very comfortable following scripts unless we are call center operators and we are armed with the appropriate tools. And we are genius synonym finders as long as we’re not writing an essay, then we get writer’s block! Defining the utterances in your skill was a combinatorial exercise, with a sheer number of combinations. I guess every developer had a strategy to deal with it. For me it was a relational database containing the possible values for expressions, and some SQL queries that would create all the possible permutations and populate the required JSON files on the go, then paste it all in the Amazon Developer console.

Keeping track of status in the skill was also hard. Say, we need to gather information to populate three slots. We can ask for slot 1, then slot 2, then slot 3. Or we could ask for slot 1, then slot 3, then slot 2. Or, in this order, slots 2, 1, 3. Or slots 2, 3, 1. Or slots 3, 1, 2. Or slots 3, 2, 1. That’s six combinations. But also users could decide to give us the information about two slots unprompted, in a single utterance. So it could be 1,2 then 3, or 1, 3, then 2, or 2, 3, then 1. Or… Surely you get the idea: managing this by hand was too hard. The alternative was a fixed order in the dialog with the user, which could feel stiff for the free spirits, or boring and robotic if the skill is to be used frequently.

Dialog Management was quite a stepchange since the early days of Skill creation, nevertheless the way to leverage it was via coding.

It’s no surprise I was very happy to read about the July 22nd announcement of Alexa Conversations. As the name suggests, it places the conversation at the core of the Skill creator experience (note I did not say “developer”). The definition of what a Skill is gets clearer: a set of possible dialogs between Alexa and the user whereby enough information is gathered, then Alexa provides some useful functionality to the user.

The cornerstone of Skills development is the creation of possible dialogs, specified, well, as a dialog in a novel: in turns.

User: I want cookies.
Alexa: What type of cookie you want?
User: Round ones.
Alexa: Do you like cinnamon?
User: No, I hate it.
Then I will bake a batch of chocolate chip cookies. Are you very hungry?
User: Oh yes.
Alexa: Then I’ll bake 24 chocolate chip cookies.

You should create as many dialogs as different combinations of turns are possible (hint: not the order, though). Another possible dialog for the example above could be:

User: I would like to have 24 chocolate chip cookies.
Alexa: OK, I’ll bake 24 chocolate chip cookies.

Once you are happy with your dialogs, you need to annotate each turn: what role in the conversation does the turn have (is Alexa asking for parameters? Is the user providing information? Is the user confirming what Alexa is suggesting?), what are other ways that you expect the user to say the same thing (these are called Utterance sets), what are other ways for Alexa to say the same thing (so that the Skill doesn’t feel monotonous if it’s used frequently), etc.

The second key concept is that of an API definition. The API is the provider of the service rendered by Alexa and it will be implemented programmatically. The API definition is just its representation: what parameters it takes as input, what kind of data it returns as output, and so on.

To build the Skill, Amazon will use its computational power to “train the model”. This is the creation of a program that takes into account all the possible combinations of how a conversation could happen: both the dialog sequences and the language elements (vocabulary, synonyms, grammatically equivalent sentences, etc.). Here’s a comprehensive explanation of the AI techniques involved . Tens of thousands of permutations are typically created, and this way the probability of Alexa reacting correctly to user input is highly increased.

So the bulk of the work shifts from development to a friendly user interface where the designer can focus on the voice experience rather than excessive technicalities.

Development of the API itself becomes a simple business of receiving parameters, probably invoking a mature API that is already being used by other clients (web, apps…), and passing the result back to the Skill.

If you’re a one-woman band (my case), you greatly appreciate the speed in which you can bring your concept to reality. If you’re a company you can better leverage the skills of your associates by having dedicated roles. Psychologists, marketeers, product owners, almost anyone who is web savvy and is a subject matter expert can work on the dialog curation part of building a Skill.

Wilkommen! Amazon Echo and Alexa now speak “the Queen’s English”… and German!

In September 2016 (time flies…), Amazon announced that the Amazon Echo and therefore Amazon Alexa would be made available in the UK and in Germany.

One would think that this would affect two geographic areas and only one language, but nothing further from the truth. Trying to make Alexa understand Geordie or Scouse makes Deutschsprache crystal clear.

So, from now on, there are three languages that you should consider when you define your skill: English (US), English (GB) and German.

It’s very important that you realise that Geography and Language are different things, and you have to make decisions on both areas. I.e. you can publish a Skill in Germany in English (US) and German, or you can decide that your Skill won’t apply to expats and you want to publish it for Germany and in German only. When you define the Interaction Model, you define as many models as languages you wish to implement. When you provide publishing information, you decide on the geography.

In our next post we will solve the following riddles: What happens to my “functionality”, do I need to create one version per language (hint: don’t do it!!!). What are the implications of limiting my Skill to a certain geography? Then we will write a bit about predefined Slot types and multilanguage implications.

Creating “The Functionality” Part 1: Introduction and “Existing Functionality”

In the post with the overall description with the magic formula for Skills, we broke those down in two parts: The Interaction Model and “The Functionality”. My usage of quotation marks is not just for fun. In the documentation provided by Amazon, the interaction model is mentioned by name all the time. The other part, no. So I decided to coin the term myself. Any kind readers with a better suggestion please leave it in the comments!

So, it’s now time to discuss “the Functionality”. This is what I’ve already said about the matter:

“The functionality” can be an application that already exists (Fitbit, Uber, etc. were happy systems with millions of users before Alexa was invented), or one you make now to be used specifically with Alexa. In the first case, the developers for that existing system will have to develop an interface that uses the AVS API. Well, actually, a product manager will have to identify the functionality that will be used via Alexa, then the developers will encapsulate that functionality in a way that can be exposed to the AVS API so that Alexa can use it. In some cases the developer and the product manager are the same person!

If you’re creating a Skill from scratch, then Amazon recommends that you build and host “the functionality” with Amazon Web Services and they suggest you do it as a Lambda Function. We’ll speak a lot about this soon, stay tuned!

I don’t know if it was clear enough, so here it goes. “The Functionality” is the stuff that the Skill actually does (telling the time, the horoscope, telling you the status of a flight, telling you how to prepare a mojito, suggesting which wines go well with pasta, etc.) And of course this functionality can already exist and is currently used through a different format (smartphone app, Web application, wearable, plain old desktop application, etc.) or you can create something totally new.

Existing functionality: Focus is Key

This will be the most frequent case. Your bank decides to offer their services to Alexa. Your fitness tracker expands the way to interact with you with voice. And a long et cetera. Every month new Skills with existing functionality are published in the Alexa Skills list.

So, how does it work? You already have a working system with zillions of users. How do you add it to the list of stuff that Alexa is capable of doing? Well, first of all you need to define what functionality you want to expose to Alexa (“expose” here means “make available to”). Imagine you’re a bank. What do you want your customers to be able to do with voice? You have to get a lot of things into consideration. Stuff that is different with voice than with other interfaces. This list is not exhaustive:

Security & Privacy considerations: anyone in the house can give instructions to Alexa. It’s probably not a good idea to be able to do bank transfers via voice. And everyone around will hear what Alexa says. Is it okay to hear your account balance? Don’t even think of protecting transactions with passwords. Because the point of Alexa with many user personas is that they can only use voice, and the possibility of eavesdropping makes saying passwords aloud a no-no.
Ergonomy: Okay, this is the realm of the Voice Interaction Designer, but you really need her input to decide what will fly and what will never work. Imagine you want to interact with your fitness tracker via Alexa. Will there be any value in hearing the list of your heart rates minute by minute? Will you remember it, will you aprehend it? The amount of information that a human can process depends on the sense being used. Sight is okay for browsing and for finding a needle of information in a visual haystack. Hearing is not.
Value and coherence: You want to implement stuff that is useful to the user and that brings value to your organization. And you want to paint a coherent picture to your user. He or she should not get frustrated because things you’ve implemented lead her to believe that similar or related things, ones that seem equally important to the user, are also implemented, when they are not.

Sounds daunting? No, not really. It’s just a lot of work. This is why you need a Product Owner, or you need to be able to act like one and devote enough time to it, when designing any kind of system. You need someone who knows well the needs of the organization, the needs of the user and who is capable of understanding the possibilities and limitations of the technologies being used.

Okay, imagine you’ve done all of that and you have a list of “services” you want to use through Alexa Voice Services. What do you have to do now? Easy. Get your Developer and your Interaction Designer together, get them together, make them read and understand the post about the Interaction Model (probably they know much more than me, so maybe they skip this part!), make them agree the “contract” between Functionality and Interaction Model (the Intent Schema) (don’t let them part ways until they do this!!)

Technical Implementation

Now your Developer can start work. It’s all about creating a Web service that exposes the functionality that you wish to serve via Alexa. Remember this diagram? It’s the bit at the bottom right.

AVS Overview – “The Functionality” depicted on the bottom right part of the diagram.

Your Web Service must comply with the following (extracted from here. My comments between [square brackets]):

The service must be Internet-accessible. [Pretty obvious, eh! But not easy to achieve in some big organizations.]
The service must adhere to the Alexa Skills Kit interface. [More on it later]
The service must support HTTP over SSL/TLS, leveraging an Amazon-trusted certificate.
- For testing, Amazon accepts different methods for providing a certificate. [i.e. you don’t have to shell out money buying a certificate when you’re just testing]. For details, see the “About the SSL Options” section of Registering and Managing Custom Skills in the Developer Portal.
- For publishing to end users, Amazon only trusts certificates that have been signed by an Amazon-approved certificate authority. [You work with Amazon, you leverage their services, you accept their rules. Certificates are a matter of trust anyways and you should use the ones they trust!]
The service must accept requests on port 443.
The service must present a certificate with a subject alternate name that matches the domain name of the endpoint.
The service must validate that incoming requests are coming from Alexa. [This last point is actually trying to protect you from DoS attacks]

So, the secret of the sauce is in complying with the Alexa Skills kit interface. And believe me, this will be trite unless you understand how the custom skill works, what you need to do to react to Intents, how to handle slots, and so on. To do that, you need to understand very well the interface specification but most important perhaps, have a broad picture of how everything clicks together. To do that, I recommend two things:

Study the functional design guide.
Study the examples provided by Amazon as Lambda functions.

This will be time well spent, it will pay off with a high return rate later.

Good luck!

How does a Skill work? How can I build one?

In a previous post we explained that Alexa can integrate both with hardware and with software, that is, you can create a voice user interface for any application (existing or new!). We barely scraped the surface of how hardware integration works, and we also mentioned a hackster.io initiative to get makers excited about it. We will now cover the integration with applications.

The first step to leverage Alexa Voice Services for your application is to understand what a Skill is, how it works, how they are created, and (very important) their lifecycle.

What is a Skill?

Imagine that you write down all the things you’re capable of doing. It will be a pretty long list! You can sleep, eat, recite Shakespeare sonnets by heart (maybe!), calculate square roots (but if you’re like me, you’ve forgotten how to do it, even though you know it’s somewhere in the back of your mind…). How would you call each item in that list? Maybe “stuff I can do”? What about calling them “my skills”?

Well, the good people of Amazon have done precisely that for Alexa. They have crafted a list of all the things that Alexa can do. A Skill in this context is “something that Alexa can do”. The list of Alexa skills can be found if you log on to alexa.amazon.com and select the Skills section, or if you have already installed the Alexa app on your Smartphone, you can also check it out. There are Skills that were created by Amazon as “basic functionality” (tell the time, set alarms, manage a to-do list, tell jokes!), and there are Skills created by everyone else. Some of them are from prominent companies (Fitbit, Uber, …), some of them, the most fun actually (!), are the ones created by independent devs like yours truly.

How does a Skill work? (as a user)

One of my favorite skills is called Big Sky. It is a weather forecast tool that uses my location (based on my Echo’s IP address), as opposed to Alexa’s “basic functionality” Skill that assumes that like Alexa, I, too, live in Seattle.

This diagram represents how I interact with Big Sky via my Echo device:

Wake word: Your Echo device is always listening but doesn’t really care about the noise it picks up unless it recognizes the Wake word. I like to think of the Wake word as the magic spell that brings the Alexa spirit to the otherwise inert black cylinder (yes, she has a distinct personality and I think she’s as alive as my 14 year old cat Sofía: their personalities are actually similar!!!). Originally there was only one Wake word (yep, that’s right: Alexa) but in view that Alexa is a not-so-uncommon ladies’ name and it could be very, very confusing to use an Echo in a household with someone called that, Amazon has expanded the list of possible wake words to: Alexa, Amazon, and Echo. To change it, go to the Alexa Web site or app, go to your Echo device, settings, wake word, and select your choice from the pick list.

Invocation name: That’s how Alexa knows what she has to do besides understand your English! You can think that the invocation name is just the skill name, but that’s not true 100% of the time. When you create the Interaction model of your Skill (more on that a bit later!) you specify two things: the skill name (that’s what appears in the Skill list on the app or on the Web site) and the invocation name (if your skill name is too long or difficult to pronounce, you can choose something simpler, but if you’re happy with your skill name, you can just pick the same one).

Slot: You’re a programmer: that’s a variable and your skill may use none, one or as many as needed. You’re not a programmer: a slot is the “placeholder” for the extra information that you give to the Skill so that it actually does exactly what you want. Everybody: If your skill helps people check the schedule for flights, you may need three slots: arrival or departure or both; flight number; flight date. In the example above I chose to specify the location for the weather forecast, but this was not compulsory because the Skill is clever enough to pick up the location from the IP address of my Amazon Echo.

It takes a little while to get the hang of interacting with Alexa (she even appears to get a bit frustrated when she doesn’t undestand you, but so do you!) but the golden rule is to always say the Wake word first, then use the Invocation name very clearly. Don’t worry about grammar and you don’t have to be polite with her (pleases and thank yous are ignored).

How does a Skill work? (for real)

A Skill has two distinct parts: the Interaction Model and what I call “the functionality”.

The Interaction model is everything related with speech. It’s where you specify the Invocation name, the slots that your Skill can understand, and very important, examples of whole sentences that your Skill can process. These sencences are called “Sample Utterances” and you will spend many hours perfecting those. There’s also something called the “Intent schema” and it’s very, very important, because it defines the different tasks that Alexa will be asking to “the functionality”, based on what the user has asked Alexa to do. It’s where you define the hooks between the two parts of the Skill.

Interaction Model Sections for my Wine Expert Skill.You work on the Interaction model through the Amazon Development console. When you create a Skill, you get 5 sections to fill out plus some testing tools (you can see those in the picture above). One of these sections is for the Interaction model. In another post we’ll describe it in depth.

A very important section has a cryptic name: “Configuration”. Besides other important stuff, that’s where you define where “the functionality” actually is. So, remember: with the Intent Schema in the Interaction model you describe “the hooks” between Interaction model and “the functionality”, but it is here where you say where to go for that functionality.

“The functionality” can be an application that already exists (Fitbit, Uber, etc. were happy systems with millions of users before Alexa was invented), or one you make now to be used specifically with Alexa. In the first case, the developers for that existing system will have to develop an interface that uses the AVS API. Well, actually, a product manager will have to identify the functionality that will be used via Alexa, then the developers will encapsulate that functionality in a way that can be exposed to the AVS API so that Alexa can use it. In some cases the developer and the product manager are the same person!

If you’re creating a Skill from scratch, then Amazon recommends that you build and host “the functionality” with Amazon Web Services and they suggest you do it as a Lambda Function. We’ll speak a lot about this soon, stay tuned!

So, once you build the Interaction model and “the functionality” and you hook them together, you’re ready to roll. And now you’re ready to learn about the lifecycle of a Skill. You’ll need it.

Lifecycle of a Skill

A picture is worth a thousand words. So:

First, you create the Skill. In the Amazon Developer Console you will see that you have one Skill in your list of Skills, with a status of Development.

You work on it as we’ve described before, and when you think it’s ready to be published, you submit it for certification. From this moment on, you can chill and take a break, because the Skill is frozen, meaning that you can’t work on it, you can’t even test it! It will take the Amazon guys a couple of days to get back to you with good news (Skill accepted!) or good news (there’s opportunity for improvement! Okay, not so good news, this means your skill has failed the certification process). When the Skill fails the certification process, it gets back to Development status and you can work on it again.

If you think you made a mistake while you’re waiting for certification feedback, you can withdraw the certification request. You can work on it again and re-submit for certification when you’re ready.

Once you get it right and your Skill gets certified, interesting stuff happens. First, your Skill becomes two Skills: one with Production status, frozen, impossible to modify, and another one with Development status. If you wish to add more functionality to your Skill, then you would work on the Skill under Development, and follow this process again. When you submit this new version of the Development Skill for certification, once it’s approved, the “original” Production Skill will disappear and be replaced with the new one, and you’ll get again a Skill under Development, just in case you wish to add more functionality again.

Understanding the Skill lifecycle is very important. Typically you learn it by practice (I haven’t seen a diagram like mine anywhere!), not without a good deal of uncertainty (where is my Development Skill! It has disappeared! How do I create new versions! Is there life in Mars! Et cetera). So I hope you find this explanation useful.