Summary
Rubine developed an application that allows for the development of collections of highly accurate (97% accuracy) sketch recognizers to be built and trained quickly and easily. More accurately, Rubine’s methods do not apply to sketches, as one might commonly think of them, but are limited to single stroke (without lifting the “pen” from the "paper”) gestures. This limitation prevents the use of complex strokes, forcing some unnatural methods to draw certain gestures (i.e. drawing the letter “x” without lifting the pen) and introducing a bit of a learning curve, but simplifies the computations needed to recognize gestures immensely.
Built on top of his GRANDMA system, Rubine’s gesture-based drawing program (GDP) allows one to define gesture classes. A gesture class gives meaning to the gesture. For example, a gesture determined to belong to the “draw rectangle” gesture class will perform operations defined by this class. After defining a new gesture class, the user will input several examples of strokes used to denote that gesture, accounting for any variability in size/orientation of the stroke in the training examples. Semantic meanings can be defined for the gestures so operations can be performed once they are recognized.
Rubine defines 13 features that are used to describe a stroke, including data like initial angle, size and dimensions of the bounding box around the stroke, the amount of rotation in the figure, etc; each computed from the sample points that make up the “line” drawn by the stroke (each sample point contains an x and y coordinate along with a timestamp).
The features are used in a linear classifier to decide what gesture the features “look like.” Given a gesture class c, weights are computed for each feature in this class. The feature values for this stroke are multiplied by the weights for this class and summed together with a weight for this class in general. The class that gives the highest “sum of weights” is the class that the stroke most looks like, and the stroke is classified to that gesture. The weights are computed using the inverse covariance between features within a class. Inverse covariance is high when covariance is low, meaning independent features are weighted higher than highly correlated features. This makes sense because you don’t want highly correlated features to affect one another, you want clear-cut decisions. In the case that a stroke ambiguously looks like it can belong to more than one class, statistical methods can be employed to reject ambiguities that fall below a certain probability of being correct.
To improve on his method, Rubine suggests eager recognition—resolving a stroke’s class as soon as it becomes unambiguous, and multi-finger recognition—using systems that recognize two strokes at the same time—for possible adaptation to multi-stroke gestures.
Discussion
The gesture recognition system is incredibly simple on many fronts. It’s simple in that one can set up new gesture classes with a few mouse clicks, simple in that only a couple dozen training examples are needed per class, simple in its computation of various features, and extremely simple in its use of a linear classifier to do its work. However, it is also simple in the types of gestures that it can interpret (only single stroke, no segmented gestures) and the number of gesture classes that it can handle. In spite of its simplicity, or perhaps because of it, GDP seems to fill a nice niche. For simple sets of classes that can be drawn decently with one stroke (for example, the Latin alphabet), this method is extremely accurate. Obviously it can’t be used for more complex things (differentiating between a single stroke arm chair versus a single stroke couch in a system for interior designers might be a bit ridiculous), but its domain is not complex strokes.
One thing I was wondering about was regarding the eager recognition technique. Rabine mentions that a stroke is classified to its gesture class as soon as the classification becomes unambiguous. Without digging through the Internet to locate the referenced papers, I was trying to think of different ways he might do this. It seems like the easiest way would be to use the same probabilistic approach used to reject ambiguous classifications. Simply keep track of a percentage for each class, and when one class gets above some threshold (and the rest below that threshold) simply make the decision right then. It seems like you could do this fairly cheaply in terms of complexity (just constant-time updates to things such as total rotation or total distance, and simple recomputations of length from start to ending points).
Also, to get free training examples, I wonder if Rabine adds successfully classified strokes into the training set (requiring a fresh computation for the mean feature values, covariance matrices, and the weights). This would give free data that the user hasn’t “undone” and might help to learn a particular user’s style of creating the strokes (over time the user’s subtle nuances would dominate the feature means and allow the recognition to evolve).
And in final contemplation…Different strokes for different folks (kill me now, but I couldn’t resist the temptation).