Smart Assistants are not very smart
Firstly this post is not to single out any particular app, phone or smart assistant. This is about sharing with you how smart assistants should work. I do not own a smart assistant like Amazon Alexa or Google Home and do not know about its capabilities, however, I do have an Android Smartphone and don't see any reason why I cannot use it in the same way should they even meet my demands and from what I have read they don't either. The smart features I would like to use I would typically use on the go or while driving.
Some of these features that I request may also already be possible but require 3rd party applications some of which I tried and they did not meet my requirements in any way.
So I will just go ahead and describe the perfect smart assistant that I need. I will also describe specific scenarios and how I envision it to work and mentions some shortcomings I have come across in assistants I have tried so far.
Mobile App
It must be a mobile app or integrated into the phone's operating system preferably reinstalled in new phones with the option to enable or disable it and configure it to your liking. I do not want another hardware device. My phone already has a speaker and a microphone and it must be able to use that. It also has Bluetooth to connect to input/output devices like my car or speakers at home. The phone also has an internet connection WIFI and Mobile Internet 3G/4G which it may need. And the processor on modern smartphones are powerful enough to do voice processing.
Wake-up command
The wake-up command - Whats up with that? This is a massive barrier to seamless use to me. I should not have to wake my phone up then wait for a sound to say it's ready in order to issue commands. If I am driving and I receive a message the phone knows I received a message. Why can I not just say "Read the message" Or if it really needs a wake-up command why can I not just use the wake-up command in the same sentence as the instruction such as "Google read the message".
I also want to define my own wake name not have to use "Ok Google", "Read the message" which currently is not supported anyway but it's it could read my messages I just want to be able to talk to my phone in the same way as I am talking to the person next to me in the car. If my phone beeped with an incoming message and I just say "Read the message" the person next to me would understand and read me the message.
Why can smart assistants not do this simple task? Because in reality, they are not very smart.
Notifications
The phone must be able to read any notification on demand, not just SMS text messages. This includes Whatsapp messages and other text messaging apps and system or app notifications such as incoming mail. If the phone beeps, I just want to be able to say something like "what's that?" or anything natural. Not just very specific memorised phrases like "Read the message". I might not know what the notification is. So it might reply, "You received a Gmail message from John" or "Your phone needs to be cleaned up" or if it's a text message from someone in any messaging app, "John says 'Are we going for lunch?'". I must then be able to reply in any relevant way without repeating the context. Such as Reply "No, I have a meeting". It must then reply using the same communication method you received a message.
For System notifications like the cleanup notification, I should just be able to say, "fine run it" or "no not now" It must know based on what we were talking about what I must do. Most existing apps will probably reply with "I did not understand that?" Or ask 5 questions before sending a message back. Such as which number and show me a list of options on a screen I am probably not even able to see. Just do it. Don't ask me for confirmations unless absolutely necessary.
Reduced interaction
I must be able to send a message to anyone in a single sentence. I am driving in my car. I need to tell my boss I am going to be late for the meeting. Why can I not just say "Google send "Peter" a WhatsApp and tell him I am stuck in traffic and I am going to be late for the meeting." If I was instructing the person next to me to send that message on my behalf they would understand perfectly. Let me know what you understood saying something like "Send Pete a message: I am stuck in traffic and I am going to be late for the meeting". And I must then just say "Not Pete, Peter" and it would say "Send Peter a message: I am stuck in traffic and I am going to be late for the meeting". And I must then just say " and I would say yes and it would say "Done" Or if it thought I said, "Send Peter a message: I am stuck in traffic and I am going to be early for the meeting". I would reply not "Not Early late you retard" and it will know what to correct repeat what it thinks is correct.
The point is if there is a misunderstanding it must be smart and know what we are talking about rather than make me repeat the entire command all over again or rephrase stuff until it understands. A person next to me would know the context what we are talking about and a "Smart" assistant should too.
Application API
Apps should have an API they could hook into to execute commands. Such as a Food ordering/delivery app would understand placing food orders. I would just say "Google order 2 piece chicken and chips from KFC". It would respond with "Ordering 2 piece chicken and chips anything else?". I would say "yes, add a coke", Google: "Added a Pepsi", Me: "Coke, not Pepsi", Google "Changed Pepsi to Coke, anything else" Me "No", Google: "That will cost $10 go ahead?", Me: "Yes" Google "Order placed delivery in 30 minutes". Note this assumes you have setup default assumptions in the app like default payment option, default delivery address and maybe default sizes for chips and drinks. However, I can easily override those by being more specific in my instructions. Such as "Google order 2 piece chicken and large chips from KFC and deliver to work, pay using my XYZ bank credit card".
Similarly a music application could hook into the API and allow complete control over you playlist.I won't give any examples as I think you get the idea.
System functions
I am in my car and the phone is in my pocket and I get a message. I ask it to read it. I can't hear it since the phone is in my pocket and it's playing over my phone speaker because for some reason my Bluetooth it of which for some reason switches itself off after a reboot(which it should not do but that's a different issue). I must just say "repeat over Bluetooth". It must turn on Bluetooth and repeat what it said over Bluetooth. I must also be able to give arbitrary commands like "Google turn on Bluetooth" or "Google turn up the volume 20%" or "Google Enable WiFi". "Google Disable data" or whatever I want really.
GPS/Maps control
Somehow and maybe I am not setup correctly but using Maps is still a pain. I have to turn on my phone and click a voice button to ask it to navigate somewhere. Then if it's unsure It will give screen prompts. If I am interacting with you using voice let me continue to do so. I don't want to have to turn on my phone to ask it to "Navigate home". Just let me say "Google navigate home" The phone turns on the maps application launches and navigates to my chosen destination. If there is uncertainty it can voice prompt me and I must be able to say "The first option" or "no not there I said Home not The Dome"
Making calls
Pretty much the same applies here as for messaging however I sometimes I have multiple numbers setup or different methods of calling such as using Whatsapp. It must understand that. Just use default numbers unless I am specific and say call "Google Call John's work number" Don't ask me. If there is more than one John start learning who I call most often. Initially ask which one later default to the most common. If it makes a mistake it's fine I will correct you. Such as you may say calling "John Doe" I will interrupt and say no "John Peters". Some mistakes will cost me a small amount of money I am fine with that as I prefer the convenience of a Smart assistant to a dumb one with 20 questions and confirmations.
Make the level of confirmations configurable if you have to. Learn from my previous actions and learn the way I speak to you and what I mean when I phrase things a certain way. If you see I always call John on his home phone after hours use that as default during the day you might learn I use a different contact number or method. Also allow me to teach you the pronunciation of names in my contacts, allowing me to train for each contact if I want to. I have names that are not very English, but I can teach you if you allow me to.
Security
I know you have been considering the security implications of anyone just issuing commands to your phone. Voice recognition is possible, especially for the key phrase that starts a new conversation such as "Google" (I hate the "OK Google" part). Also many people don't care too much about 100% privacy. In general people that try to control you phone will be branded arse holes and the joke will become stale.
Typically I won't have my phone speak to me in public anyway. But say I am at work or on the subway and I have an incoming message, yes someone may try and be funny and say "Read the message". If there is voice recognition at least there will be some security. If that fails you could have a stop phrase, such as just say "Stop" or no and the voice stops immediately. People may also try and control you phone when you are not there for e.g. you leave it around the house or at your office desk.
There are security measures one could put in place however that's kind of beyond the scope of this article. This is more about convenience and those that require Fort Knox security should probably not use this feature. For stuff like payments maybe you could have a voice password. Yes others may hear your password however this is for small payments of convenience such as ordering a meal while driving where no one is likely to hear your password anyway.
Performance
Performance should be seamless. There should be no pauses and voice should be interpreted on the device without the need to upload voice. Yes you will likely need a high end phone to use this feature and it will probably eat battery. But It's not about building thinner lighter phones. Add batteries with more capacity. We are adding a next generation feature that truly changes how you use the phone. I expect that increased hardware requirements will cost more and be bigger and heavier than current phones that do not support such a feature.
News, weather and internet
Many devices now already try to cater for this. They try to seem smart because you can ask arbitrary questions like the population of New York etc. But it's just doing a lookup on the internet. Nothing smart about that. Not that it should not continue to do so, it's just that I won't go into much more detail about this. Being smart is more about understanding me and talking back to me like you understand what we are talking about even if the conversation between us is 10 sentences. If we were discussing Garlic in the first sentence you must be smart enough to know if we are still on the same topic 10 sentences later or if we have moved on.
Accessibility
While this feature is intended mostly for the convenience of able-bodied people it will be a massive upgrade to lesser able people with specific accessibility needs. I always believed that accessibility features on most electronic devices are rather neglected and there is massive room for improvement. This will help such people a lot.
Conclusion
I am not all that impressed with the ability of so called smart phones to be used as a hands free device or assistant. Existing apps have always been hard to use or do not work correctly and do not have half the features I really require. I don't think it's because the technology is not there yet. It is, I just cannot understand why no one has made a real effort. Instead they would rather sell me a secondary device that implements some of these features but few that I really want and that is a hand free smart device in my pocket.
For banking what you could do is use voice biometrics plus a keyword... Not the same password as you use for your regular login though... Great post though and you have pretty much nailed down a template for how smart assistants should work.