Tuesday, September 4, 2007

Visual Similarity of Pen Gestures

A. Chris Long, Jr., et al., Visual Similarity of Pen Gestures, 2000
A. Chris Long, Jr., et al., ”Those Look Similar!” Issues in Automating Gesture Design Advice, 2001

Summary



Long et al. set out to determine what makes gestures similar or dissimilar, based on human perception. Two sets of experiments were conducted to determine this. In the first experiment, 20 people were asked to look at 14 gestures in all possible groups of 3, picking the gesture from each triad that was the most dissimilar. The different gestures were of a wide variety of types and orientation. Geometrical properties of the gestures were taken into account and, using multi-dimensional scaling, were used to describe what made different gestures similar/dissimilar. Five dimensions of geometric attributes were selected, which were then mapped, using linear regression, to different features (i.e. Rubine features) of the gestures. The derived model correlated 0.74 with the experimental user perceptions of similarity.

The second trial was set up to test the predictive model of the first experiment, as well as test how varying assorted features affected similarity (total absolute angle and aspect, length and area, and rotation and its related features). Experiments were performed as in the first trial, where test subjects were shown triads and asked to choose the object that was the most dissimilar from the other two. When the three sets of varying features were fit to a predictive model, it was discovered that the angle of the bounding box and the gesture’s alignment with the coordinate axes played the most significant roles in determining similarity. It was also discovered that the model generated in the first trial outperformed the predictive power of the model in this trial.

Some of the more interesting results were that most of the similarities between gestures could be explained by a handful of features (that accounted for three of the dimensions in the MDS). The remaining MDS dimensions required many more features to describe. The authors attributed the difficulties of determining object similarity to the complexity of the models, the limitations on the amount of training data, and the subjectivity of a determination of similarity from one person to another.

In the shorter paper, Long, et al., use the similarity prediction models (trained on even more data) to create a tool to assist creators of gesture sets in creating gestures that are deemed dissimilar by people (so that gestures are easier to remember, this task performs horribly) and dissimilar by the computer (so that gestures can be accurately identified). The advice, as it is called, on gestures that are too similar is given after the user trains the new gesture class, as opposed to as soon as the first gesture example is drawn, because future strokes may altar the average features of the gesture class enough to make it significantly dissimilar to the computer. Or, if the analysis deems the gesture to not be distinguishable by humans, the advice is given immediately because more examples won’t change the basic construct.

Discussion



First, one of the equations in the long Long paper is wrong. The formula for the Minkowski distance metric should be (in LaTeX-ese):

d_{ij}^p = \left( \sum_a^r | x_{ia} – x_{ja} | ^p \right) ^{1/p}

The paper is missing the (...)^{1/p} around the summation. Otherwise, Euclidean distance would be missing its square root.

I thought it was interesting that while the log of the aspect is proportional to the actual aspect, the log of the aspect outperforms the plain aspect in determining similarity. This actually boggled me, as I’m used to treating things like log likelihoods and likelihoods like they’re basically equivalent (i.e. maximizing one obviously maximizes the other). So why don’t similarities in aspect land translate into similarities in log-aspect land?

It annoys me when people whine about not having enough training data to model all the connections between similarities, etc. Isn’t that the point? If you had an infinite amount of training data, all these fancy algorithms you use would be pretty worthless. All you’d have to do is look at all your data and pick the right answer. It’s like having an Oracle to solve the ATM (accepting Turing machine) problem. If you have an Oracle, the problem just seems to disappear. The point of this exercise is that you don’t have enough data. You will never have enough data. Even if you had enough data, you’d need some magical way to handle it all in an efficient manner to extract any information out of it in a useful amount of time. Your methods and algorithms should deal with this shortcoming in a way that is appropriate to the domain.

Other than that rant, it was neat to see how the authors used MDS to tie the psychological and very subjective portion of the experiment—determining gesture similarity—into something mathematically well defined and robust—a linear model fit with regression. I’m surprised that the models were able to do better than random at predicting which gestures were similar, given the complexity of the domain and the nuances in human judgment (where by nuances I mean silly fickleness). It was even more interesting to see that the majority of differences from one gesture to another could be accounted for by modeling only a small subset of the features.

And lastly, for my pithy quip:
One of these things is not like the others...One of these things does not belong...

2 comments:

Paul Taele said...

When my classmates and I were doing our projects for undergrad neural networks, we always had to deal with training data. Whenever our networks had low success rates, my classmates would usually complain to the prof that our networks were performing so badly (no one did better than 70%), because we didn't have enough training data. The following day, my prof did the assignment himself and showed the class how his network achieved a 95% success rate. From that point on, my class never complained about the lack of training data.

Your Oracle analogy in the discussion comments reminded me of that time. The difference is that I dogmatically accepted back then that training data isn't everything. Your post actually explained why that was the case. It hit the point on why those complex algorithms exist in the first place. Had my class known that before, we wouldn't have had learned the hard way through humiliation. :P

Your Oracle analogy may have been a rant to you, but I believe this would be an invaluable concept to future novice AI students. :P I shall end my own rant...now.

Miqe said...

I agree whole-heartedly. There will never be enough data, so we need to find the best way to deal with what we've got, and we should learn to expect the worst when dealing with Sketch Recognition, which would be little to no training data (though I don't have a clue how "none" would be possible). There's no point in expecting the training data to be perfect or fulfilling, because it will never happen in real life.