Summary
Mahoney and Fromherz identify three requirements that they fell any sketch recognition technology should possess. First, it should cope with ambiguity in the sketch. This includes variance due to sloppy drawings, difficulties in finding correct articulation in the drawings (corner finding), and noisy drawings (stuff in the background, etc). Second, the recognition process should be efficient and provide quick results. This means we must have methods to bound the normally exponential state-space into something that we can search in a tractable method. Third, the method must be extensible to new domains. It can't rely heavily on domain-specific knowledge.
Their method is to construct a graph of the segments in the image. Using image processing techniques (not stroke inputs) they detect segment endpoints and create nodes. The line the endpoints belong to is called a bond edge. Links between coincident endpoints are called links. The graph is put through a series of transformations that provide some variability in the possible interpretations of the structure. These include linking nodes that are close to each other (gap closure), linking around small segments if they are close to each other (removing corners, spurious segment jumping), adding nodes and links where lines intersect (adding corners to junctions), and combining corners where the segments are smooth (continuity tracing). Each of these methods has a modifiable parameter.
Subgraph matching is used to determine the best fits for portions of the graph. Shapes that should appear in the graph, and that are looked for by the algorithm, are described in a constraint language what defines the different shapes and how they are linked together. It also provides a criterion that the algorithm should try to minimize (sets of constraints, like LADDER's method of minimizing the ideal constraints). For instance, they say that a head should be half as long as a stick figure's torso, etc. They limit some of the search space using a few a priori assumptions and hints, which they call salience, anchoring, grouping, and model bias.
Discussion
They state it themselves, but hand-tuning the parameters for the perceptual organization methods for elaborating/rectifying the graph is bad. This completely violates their call for recognitions processes that are applicable across domains. They undermine their approach, set up on this idealistic claim based on their "three concerns", when they completely trash one of those concerns. To their benefit they do state the need to learn these or estimate them automatically.
In the worst case subgraph matching requires exponential time. They say this is not usually the case for "most practical problems." Defined how? Is something practical only if it conforms to the simple case of stick figures? What about something more general? Even if this isn't the case most of the time, if something does take exponential time (which you cannot guarantee that it won't), you've blown another tenet of your argument (the part about interactive recognition). There are so many a priori things they use to reduce the running time that I doubt the ability of this method to generalize to a multi-purpose recognizer. It seems like it has promise, but only at the risk of exponential running time.
Again, no recognition results. How obnoxious. I guess you can just toss tenet 1 out the door too, since you can't prove that your method does any good at handling ambiguity.
This seems like a good method to use with Sezgin for corner finding/reconciliation. I wonder if their results would have improved if they had stroke data available to them instead of simple image processing.
No comments:
Post a Comment