Specifying Gestures by Example

"Specifying Gestures by Example," Dean Rubine, 1991

Summary

Rubine developed an application that allows for the development of collections of highly accurate (97% accuracy) sketch recognizers to be built and trained quickly and easily. More accurately, Rubine’s methods do not apply to sketches, as one might commonly think of them, but are limited to single stroke (without lifting the “pen” from the "paper”) gestures. This limitation prevents the use of complex strokes, forcing some unnatural methods to draw certain gestures (i.e. drawing the letter “x” without lifting the pen) and introducing a bit of a learning curve, but simplifies the computations needed to recognize gestures immensely.

Built on top of his GRANDMA system, Rubine’s gesture-based drawing program (GDP) allows one to define gesture classes. A gesture class gives meaning to the gesture. For example, a gesture determined to belong to the “draw rectangle” gesture class will perform operations defined by this class. After defining a new gesture class, the user will input several examples of strokes used to denote that gesture, accounting for any variability in size/orientation of the stroke in the training examples. Semantic meanings can be defined for the gestures so operations can be performed once they are recognized.

Rubine defines 13 features that are used to describe a stroke, including data like initial angle, size and dimensions of the bounding box around the stroke, the amount of rotation in the figure, etc; each computed from the sample points that make up the “line” drawn by the stroke (each sample point contains an x and y coordinate along with a timestamp).

The features are used in a linear classifier to decide what gesture the features “look like.” Given a gesture class c, weights are computed for each feature in this class. The feature values for this stroke are multiplied by the weights for this class and summed together with a weight for this class in general. The class that gives the highest “sum of weights” is the class that the stroke most looks like, and the stroke is classified to that gesture. The weights are computed using the inverse covariance between features within a class. Inverse covariance is high when covariance is low, meaning independent features are weighted higher than highly correlated features. This makes sense because you don’t want highly correlated features to affect one another, you want clear-cut decisions. In the case that a stroke ambiguously looks like it can belong to more than one class, statistical methods can be employed to reject ambiguities that fall below a certain probability of being correct.

To improve on his method, Rubine suggests eager recognition—resolving a stroke’s class as soon as it becomes unambiguous, and multi-finger recognition—using systems that recognize two strokes at the same time—for possible adaptation to multi-stroke gestures.

Discussion

The gesture recognition system is incredibly simple on many fronts. It’s simple in that one can set up new gesture classes with a few mouse clicks, simple in that only a couple dozen training examples are needed per class, simple in its computation of various features, and extremely simple in its use of a linear classifier to do its work. However, it is also simple in the types of gestures that it can interpret (only single stroke, no segmented gestures) and the number of gesture classes that it can handle. In spite of its simplicity, or perhaps because of it, GDP seems to fill a nice niche. For simple sets of classes that can be drawn decently with one stroke (for example, the Latin alphabet), this method is extremely accurate. Obviously it can’t be used for more complex things (differentiating between a single stroke arm chair versus a single stroke couch in a system for interior designers might be a bit ridiculous), but its domain is not complex strokes.

One thing I was wondering about was regarding the eager recognition technique. Rabine mentions that a stroke is classified to its gesture class as soon as the classification becomes unambiguous. Without digging through the Internet to locate the referenced papers, I was trying to think of different ways he might do this. It seems like the easiest way would be to use the same probabilistic approach used to reject ambiguous classifications. Simply keep track of a percentage for each class, and when one class gets above some threshold (and the rest below that threshold) simply make the decision right then. It seems like you could do this fairly cheaply in terms of complexity (just constant-time updates to things such as total rotation or total distance, and simple recomputations of length from start to ending points).

Also, to get free training examples, I wonder if Rabine adds successfully classified strokes into the training set (requiring a fresh computation for the mean feature values, covariance matrices, and the weights). This would give free data that the user hasn’t “undone” and might help to learn a particular user’s style of creating the strokes (over time the user’s subtle nuances would dominate the feature means and allow the recognition to evolve).

And in final contemplation…Different strokes for different folks (kill me now, but I couldn’t resist the temptation).

4 comments:

Paul Taele said...: "Different strokes for different folks." Oh gawd, haha.

Concerning your comment about eager recognition, I incorrectly assumed it was simply a trivial matter. I guess it really isn't, heh. I suppose it was because I originally figured the system would keep comparing what was currently drawn to the example gestures, and then match the gesture when all but one gesture was rejected. Thinking back about that, that would be a very expensive procedure. The idea of a threshold you brought up would be a lot more logical and inexpensive. It goes to show (to me) that eager recognition may be trivial in one domain, but not so much in another; September 4, 2007 at 9:00 AM
rg said...: I think that the most interesting thing involved in eager recognition is moving straight into another interaction. It allows for interesting things to be done to streamline the experience of interactions involving sketching.; September 4, 2007 at 10:51 AM
Grandmaster Mash said...: Using drawn gestures as training sets is both useful and dangerous. Assuming that we know that the gesture they drew was classified correctly, we want to ensure that the system becomes customized for a user without overfitting to their data. If we want to fit only to that user's style than overfitting might not matter. But if it's a general gesture system that multiple people might use I'd be cautious of keeping any drawn gestures as a test set.; September 4, 2007 at 9:55 PM
Paul said...: Another thing to consider when attempting to add correctly classified strokes to the training set is the computational and storage costs associated with recomputing the the covariance and weights. While we may be using fast computers with plenty of memory, most sketch devices have been hand-held devices with limited processing power and memory, and would be unable to update on the fly. However, if space was available to log strokes, the user might be able to specify a convenient time for an update, or do so during sychronization, using the host computer's greater power for the computations.; September 6, 2007 at 12:04 AM