Voice/Hand Motion Interface
March 29, 2002
After I experienced some loss of hand functionality several years ago I spent some time thinking about a kind of interface that would allow me to work without doing an excessive amount of typing. I came up with an idea that involves a combination of voice and hand motion. I've read many ideas involving the use of one or the other, but seldom the two in combination. Perhaps people are already working on this exact idea, but if not, here it is.
Even hands that cannot type well (or long) can still point. The "index" finger is the finger that points. To "indicate" something is to point at it. With a pair of cheap video cameras, one's fingertip, with a polished nail, perhaps, could easily become an "air mouse". The software getting signals from the video cameras could determine that the "bright green spot" (the fingernail or the painted tip of a wand) is within some rectangle, thus activating the "air mouse" and mapping its position to screen coordinates.
I'm sure such things already exist, but I suspect they are almost always put to use within the confining paradigm of the GUI. How do you "click" an air mouse? You make a clicky gesture, I guess. Or a double-clicky gesture, or a Shift-CTL-double-clicky gesture, etc. You move the AMP (air mouse pointer) over the "Spray Can" paint tool, do a click gesture, then move the Spray Can Icon over the region you wish to spray, then you make a "start spraying" gesture, then you spray and then, somehow, you signal "end of spray" (with a guesture? a key? a pause?) and then you continue your "select a tool, use it" actions until your painting is complete.
There are, I believe, complimentary problems with voice interfaces. One could command "Spray Can!" or "Pencil!" or "Rectangle!" but then specifying screen locations could be tricky.
The solution would be to use the command power of the voice in connection with the pointing ability of the hand. The hand points to the nouns, the voice speaks the verbs. For example, to begin a paint session, the user moves the AMP over various document objects on the screen. When it passes over an object, the object is highlighted with a "nimbus of indication" (meaning simply, "this thing is what the heck you are pointing to") and at the same time, a menu of possible commands is displayed elsewhere on the screen, or possibly on another screen. With the AMP over bozo.gif, the menu would have items like "select, open, move, delete" and so on. ("Selecting" would allow the document to be grouped with other selected documents.) The user would speak one of the allowable commands, e.g., "Open!" Once within the document, the user might say "Pixel!" meaning "When I point, I'm just pointing to a pixel, not to a higer level 'object' such as a section or a layer (or whatever)" and then put the AMP on a pixel, then say "Line!" then move the AMP to the end of the line and then say "End!"
That is the general idea. Coming up with really good uses of this VHMI ("voice/hand motion interface," pronounced "vimmie" -- maybe VAMP -- voice/air-mouse pointer would be better) concept would take much work, much thought and collaboration. But two things must be made plain from the start.
The VHMI is not meant to be "intuitive." VHMIs should be easy to teach (depending on what underlying work they are intended to accomplish), but there is no reason at all why their workings should be congruent with the typical guesses that a total novice might make. Secondly, the vocal command syntax should not be "natural." Just think, for a moment, how far mathematics would have progressed if all knowledge of it had to be expressed in ordinary speech. The best vocal command style would resemble the "Robot -- Attack!!" computer talk of 1950s Saturday morning Sci-Fi shows, which was correctly seen even back then as a reasonable way of verbally communicating with a machine. The VHMI is intended for power user applications such as CAD which necessarily involve complex interactions of pointing and commanding.
In some applications, such as dictation, the AMP could be the "command channel" and the voice would be the source of data.
One could use both hands at once in order to have two AMPs, so that commands like "Line!" could take effect immediately. The "rectangles" in which the hands are moved could be rectangular solids for a 3-D effect. Two hands could operate in one space or in separate two or three dimensional spaces. For example, two hands could specify two points in different three dimensional spaces in order to adjust input levels from six different audio sources. This might offer a user a great deal more power than, say, a collection of one dimensional slider controls opperated by a traditional mouse.
I don't see the VHMI idea replacing GUIs anytime soon in consumer or standard office applications. But for high-end power user systems in some application areas, it could be a breakthrough.
Copyright © 2002 Permission is granted to
visitors to this site to save copies of this essay on their hard
drives for personal use and also to print copies of this essay
for personal use.
If you wish to link to this article, try copying and pasting:
<a href="http://m3peeps.org/vhmi.htm">Voice/Hand Motion Interface</a>