Shut Up, Siri
There will be no monoculture of human-computer interaction.
Every day I see a new thinkpiece on “the post-screen future” or “UI-less design” or “the end of the click.” I even used to write things like that. But that’s because I had less experience with human-computer interaction than I have now. You see, there’s this contagion of belief that new technologies not only open new doors, but definitively close old ones. But that’s rarely true. The internet didn’t end radio. The iPhone didn’t end laptops or even desktop computers. And voice interfaces won’t end screens and manual interactions. There will be no monoculture of human-computer interaction.
We may have the technology to make the click an unnecessary interaction convention; I doubt we have the desire. That is a good thing. Sure, we’ll talk to our machines, just not all the time.
The definition of the click as “a mechanical act requiring decision, precision, and a split-second negotiation between choice and commitment” is a good one, because it details all the reasons why the click is so useful and effective. However, some might imagine that a sophisticated enough machine would obviate the need for any direct, physical interaction. After all, didn’t the characters in Star Trek walk around the ship, constantly invoking the ship’s Computer to give them answers by just speaking “Computer…” into the room and waiting for its response? They did! But they also had many screens and panels and did a lot of tapping and pressing that might as well have been clicking. Sure, Star Trek was made long before we had a good sense of what advanced computing might actually be capable of, and what it might actually be like to use. But it also might be that the creators of Star Trek held some insight into human-computer interaction that shaped their world building.
Consider how your brain processes information. The eye-brain connection is one of the most sophisticated and efficient systems in human biology. You can scan a list of options and make comparisons in fractions of a second — far faster than listening to those same options read aloud. Suppose we found ourselves ordering dinner at a restaurant in a purely voice command future. I imagine that would be a lot like the moment when your server reads off the evening’s specials — what was that first one again? — but for the entire time and for everyone at the table. It would take too long, and it would be very annoying.
That’s the thing about how our senses interact with the brain — they don’t all work in the same way. You can view more than one thing at a time, identify them, react to them, and process them virtually simultaneously, but you cannot come close to that kind of performance with sound. Imagine sitting across from two friends and both show you a picture at the same time. You’ll likely be able to identify both right away. Now imagine those two friends telling you something important at the same time. You’re almost certain to ask them to tell you again, one at a time.
The screen, by the way, tends to get the blame for all the negative things that have come with our increasingly digital lives — the distractions, intrusions, manipulations, and so on — but the screen itself isn’t to blame. In fact, the screen exists because of how incredibly useful it is as a memory surrogate. The screen is a surface for information and interaction, much like whiteboard, a chalkboard, a canvas, a scroll, or a patch of dirt once was long ago. The function it serves is to hold information for us — so that we don’t have to retain it in perfect memory. That’s why screens are useful, and that’s why — I think — they were still present on an imagined starship three centuries from now along with a conversant AI.
“Clicking” — which is really just a shorthand for some direct selection method — is incredibly efficient, and increasingly so as the number of options increases. Imagine a list of three items, which is probably the simplest scenario. Speaking a selection command like, “the third one, please” is just as efficient as manually selecting the third one in the list. And this is probably true up to somewhere around 6 or 7 items — there’s an old principle having to do with our ability to hold no more than that number of individual pieces of information in our minds. But beyond that number, it gets more difficult without just being able to point. In order to say you want the ninth one in a list, for example, requires that you know it’s the ninth one in the list, which might take you a moment to figure out — certainly longer than just pointing at it.
Perhaps each item in the list is clearly different. In that case, you might just be able to speak aloud which one you want. But what if they’re not that different? What if you aren’t sure what each item is? Perhaps these items aren’t even words, in which case you now have to describe them in a way that the machine can disambiguate. What if there are three dozen in a grid? At that level of density, tracking with your eye and some kind of pointer helps you move more rapidly through the information, to say nothing about making a final selection.
Aside from human efficiency, consider also the computational efficiency of manual interaction. A click or tap requires minimal processing power — it’s a simple input with precise coordinates. Voice commands, on the other hand, require constant audio processing, speech recognition, and AI-driven interpretation. In a world increasingly concerned with energy consumption and computational resources, the efficiency of direct selection becomes even more relevant.
It’s also worth noting that different interface modes serve different accessibility needs. While voice interfaces can be crucial for users with certain physical limitations, visual interfaces with direct selection are essential for users with hearing impairments or speech difficulties. The future isn’t about replacing one mode with another — it’s about ensuring that multiple modes of interaction are available to serve diverse needs.
Instead of imagining a wholesale replacement of visual interfaces, we should be thinking about how to better integrate different modes of interaction. How can voice and AI augment visual interfaces rather than replace them? How can we preserve the efficiency of visual processing while adding the convenience of voice commands?
The click isn’t just a technological artifact — it’s a reflection of how humans process and interact with information. As long as we have eyes and spatial reasoning, we’ll need interfaces that leverage these capabilities. The future isn’t clickless; it’s multi-modal.
Written by Christopher Butler on
Tagged