Monday, September 10, 2007

MARQS

Brandon Paulson and Tracy Hammond, MARQS: Retrieving Sketches Using Domain- and Style-Independent Features Learned from a Single Example Using a Dual-Classifier

Summary



Paulson and Hammond implement a system that can be used to search a database of sketches and rank matches from the database with a high degree of accuracy. The matching is performed with a new dual-classifier algorithm and new sketch features that allow for unconstrained sketches (ones that can exist with a different number of strokes, ordering of strokes, scale, or orientation), can be trained on a single example (the first example of a stroke in the database that we need to query against), and can learn over time with successful queries.

The four features used by the authors are the aspect ratio of the bounding box, the pixel density of the bounding box, the average curvature of the strokes in the sketch, and the number of perceived corners in the sketch. Before the features are computed, the images are rotated so that their major axis (defined as the line between the pair of points farthest from each other) is horizontal.

When only one gesture of a given class exists in the database, it is classified using a 1-nearest neighbors approach. That is, the features of the query sketch are compared to the features of all the sketches in the database, and the database sketch with the least amount of difference in feature values is selected as the match for that query. When more sketches are added to the database for that same gesture class, the algorithm begins to use a linear classifier (similar to Rubine's method) for greater accuracy and speed.

In an experiment, the authors had 10 users create 10 examples of their own sketch. The first is used as the initial key insertion into the database. The remaining 9 are used as queries into the database. Additionally, 10 examples each of 5 strokes were added by the authors to test the effects of orientation and scale more precisely. The experiment was run 10 times, with the ordering of the gestures used as insertions/queries to the database randomized. The average rank of the correct gesture given the query sketch was 1.51. 70% of the time, the right answer was the top choice, 87.6% in the top two, 95.5% in the top three, and 98% in the top four.

Discussion



The authors mentioned that using the single classifier to check against the first query of a gesture class takes a long time, proportional to O(nf) where n is the number of items in the database and f is the number of features. They mentioned using some sort of generalization of the features across all gestures in a particular class to speed things up, instead of comparing to every example within the class, but at the possible sake of accuracy. I wonder if it would be possible to use some sort of database indexing scheme based on the various features to get a quick lookup of matches. You may even use the generalized features to get a class of sketches that looks more like the query than the others, and then classify the sketches in that class for a more specific match against the query input.

It seems like orientation may be important for distinguishing between many gestures. I might not want a sad face to be classified the same as a happy face. The authors mentioned transitioning to a grid-based approach if this was the case. I wonder if there is some sort of thresholding that can take place so that large changes in orientation count as differences, but small changes in orientation are ignored. This might allow for a bit of wobble in the user's drawing, but allow for large differences (most likely made on purpose) to be counted differently. Of course, this would be impossible, it seems, given the current method of rotating the image to make its major axis horizontal. Unless one could take the orientation features before the rotation.

It also seems like a very difficult problem to extract images from an electronic journal, separating them from the other objects (other sketches, text, ...) that surround them on the digital page. In the future work this was mentioned as perceptual grouping and seems like a very difficult task. I can imagine scenarios where I draw a particular molecule for the chemistry e-notes. I then annotate the molecule with information like bond energies, etc. When I search for it later, I don't include the annotations, just the shape of the molecule. Is it possible to extract such information? Perhaps by using some sort of layering mechanism where the strokes are stored as layers on top of one another, so that the text annotations can be separated from the drawing of the molecule. Very difficult, indeed...

2 comments:

Brian David Eoff said...

Searching of this type of heterogeneous data is difficult, but it depends on what you want search to give you. If you want it to be able to take you to a place in the e-journal, then your annotation example doesn't really hold (once you followed the search result you would find the annotations). The question becomes how do you parse the various items on the page? How are the elements on the pages clustered? What is the index built upon?

Grandmaster Mash said...

Orientation is tricky because there are many scenarios in which I would want orientation to be a feature and other scenarios in which I would not. For instance, in sketched molecules I would not want orientation to be a factor. Yet, with the smiley face issue we have subtle variances should be recognized. I think that overall a sketch will have enough small variances with orientation that it shouldn't be used as a feature. If the correct sketch appears in the top few of the query it's close enough.