Oltmans PhD Thesis

Oltmans, Michael. "Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches." MIT, May 2007. PhD Thesis.

Summary

Oltmans seeks a method of recognizing hand drawn shapes using an approach that harnesses a type of visual recognition rather than feature-based approaches (like RUbine or Sezgin). He analyzes both individually drawn, isolated shapes, and also automatically extracts and labels shapes from the domain of hand-drawn circuit diagrams.

Stroke preprocessing includes rescaling the shape and re-sampling the points in the stroke, interpolating new ones if needed. He then marches along the stroke and places a radial histogram ("bullseye") every five pixels, each shaped like a dart board with a certain number of wedges and log-spaces rings. Bullseyes are rotated with the trajectory of the stroke to make them rotationally invariant. The bullseyes have a third dimension, which is stroke direction through each bin split into 0-180 degress in 45 degree increments. The bullseyes count the number of ink pixels through each bin (location relative to the center + direction) and create a frequency vector.

The frequency vectors are compared against a codebook of shapes created by training on pre-labeled shapes. The labeled shapes are clustered (using QT clustering), with the mean-vector of each cluster being put into the codebook. An example shape that is to be classified has its constituent bullseye-vectors (its parts) compared to every shape in the codebook, giving a measure of distance (normalized sum of squared differences between vector elements). The min distance from each part to each element in the codebook gives a match-vector. The match vectors are classified using a support vector machine (one-to-one strategy) to match an example to its correct label.

To extract individual shapes from a full diagram, Oltmans marches along the strokes and creates windows of multiple, fixed sizes (to account for different scaling of the shapes). The ink in each window is classified, and if it is close enough to a given shape, it is put in the list of candidate shape regions. The candidate regions are clustered using EM, with large clusters being split into two regions and clustered again. The cluster representatives form the list of final candidate shapes, which are sent back through the classifier to obtain a shape label.

To evaluate his method, Oltman had 10 users draw circuit diagrams with a given set of shapes to use, but no instructions on how to draw them or lay the circuit out. The 109 resultant drawings were hand-labeled. Individual shapes were extracted from the drawing by hand and classified, giving 89.5% accuracy. He also evaluated a separate dataset for isolated shapes, HHreco, that had simpler shapes with little variance. He obtained 94.4% accuracy. When looking at how his method extracted shapes from the full circuit diagrams, 92.3% of the regions were extracted

Discussion

There are a lot of things I have to ask since it's a large body of work (not much to pick apart in 6 pages compared to 90). So I'll be brief and bullet-point things.

Were any dimensionality reduction techniques employed, especially on the codebook, to try and figure out which parts were the most indicative of shape class? With so many clusterings and parts (every 5 pixels), it seems this might have been advantageous. It might also mean you can originally construct a full codebook (not limit to 1000 parts/100 clusters) and whittle it down from there.

You seem to claim complete visual recognition and avoidance of style-based features, yet you use direction bins as features in the bullseyes. Does this really buy you anything, since a user can draw however she wishes? You're already orienting to the trajectory.

When extracting features from circuit diagrams, you have a special WIRE class to catch your leftovers. How hard would this be to generalize? Would you just throw out anything that's not labeled with enough probability?

In your results, you give confusion matrices with the recall for each class shape. It seems like the classes with more training examples had better recall. Did you find this to be strictly true (obviously it's true to some extent), or is it simply that some shapes are harder than others?

Support vector machines are complex and very expensive, especially when you need O(n(n-1)/2) = O(n^2) of them (in the number of classes). Were any other classifiers tried? Something a little more simple?

Did you experiment with the granularity of the bins, or was that the job of Belongie et al.? Did you try different spacings of the bullseyes (rather than every 5 pixels)?

jbjohns - Haptics and Sketch Recognition

Tuesday, October 30, 2007