Tuesday, October 30, 2007

Oltmans PhD Thesis

Oltmans, Michael. "Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches." MIT, May 2007. PhD Thesis.

Summary



Oltmans seeks a method of recognizing hand drawn shapes using an approach that harnesses a type of visual recognition rather than feature-based approaches (like RUbine or Sezgin). He analyzes both individually drawn, isolated shapes, and also automatically extracts and labels shapes from the domain of hand-drawn circuit diagrams.

Stroke preprocessing includes rescaling the shape and re-sampling the points in the stroke, interpolating new ones if needed. He then marches along the stroke and places a radial histogram ("bullseye") every five pixels, each shaped like a dart board with a certain number of wedges and log-spaces rings. Bullseyes are rotated with the trajectory of the stroke to make them rotationally invariant. The bullseyes have a third dimension, which is stroke direction through each bin split into 0-180 degress in 45 degree increments. The bullseyes count the number of ink pixels through each bin (location relative to the center + direction) and create a frequency vector.

The frequency vectors are compared against a codebook of shapes created by training on pre-labeled shapes. The labeled shapes are clustered (using QT clustering), with the mean-vector of each cluster being put into the codebook. An example shape that is to be classified has its constituent bullseye-vectors (its parts) compared to every shape in the codebook, giving a measure of distance (normalized sum of squared differences between vector elements). The min distance from each part to each element in the codebook gives a match-vector. The match vectors are classified using a support vector machine (one-to-one strategy) to match an example to its correct label.

To extract individual shapes from a full diagram, Oltmans marches along the strokes and creates windows of multiple, fixed sizes (to account for different scaling of the shapes). The ink in each window is classified, and if it is close enough to a given shape, it is put in the list of candidate shape regions. The candidate regions are clustered using EM, with large clusters being split into two regions and clustered again. The cluster representatives form the list of final candidate shapes, which are sent back through the classifier to obtain a shape label.

To evaluate his method, Oltman had 10 users draw circuit diagrams with a given set of shapes to use, but no instructions on how to draw them or lay the circuit out. The 109 resultant drawings were hand-labeled. Individual shapes were extracted from the drawing by hand and classified, giving 89.5% accuracy. He also evaluated a separate dataset for isolated shapes, HHreco, that had simpler shapes with little variance. He obtained 94.4% accuracy. When looking at how his method extracted shapes from the full circuit diagrams, 92.3% of the regions were extracted

Discussion



There are a lot of things I have to ask since it's a large body of work (not much to pick apart in 6 pages compared to 90). So I'll be brief and bullet-point things.


  1. Were any dimensionality reduction techniques employed, especially on the codebook, to try and figure out which parts were the most indicative of shape class? With so many clusterings and parts (every 5 pixels), it seems this might have been advantageous. It might also mean you can originally construct a full codebook (not limit to 1000 parts/100 clusters) and whittle it down from there.

  2. You seem to claim complete visual recognition and avoidance of style-based features, yet you use direction bins as features in the bullseyes. Does this really buy you anything, since a user can draw however she wishes? You're already orienting to the trajectory.

  3. When extracting features from circuit diagrams, you have a special WIRE class to catch your leftovers. How hard would this be to generalize? Would you just throw out anything that's not labeled with enough probability?

  4. In your results, you give confusion matrices with the recall for each class shape. It seems like the classes with more training examples had better recall. Did you find this to be strictly true (obviously it's true to some extent), or is it simply that some shapes are harder than others?

  5. Support vector machines are complex and very expensive, especially when you need O(n(n-1)/2) = O(n^2) of them (in the number of classes). Were any other classifiers tried? Something a little more simple?

  6. Did you experiment with the granularity of the bins, or was that the job of Belongie et al.? Did you try different spacings of the bullseyes (rather than every 5 pixels)?

Thursday, October 25, 2007

Oltmans -- ASSISTANCE

Oltmans, Michael, and Randall Davis. "Naturally Conveyed Explanations of Device Behavior." PUI 2001.

Summary



Oltmans created a system called ASSISTANCE that can take a recognized sketch (recognized by the ASSIST program) and speech input (vie IBM's ViaVoice program) and construct causal links for bodies in a mechanical engineering sketch. The example given in the paper is one of a Rube Goldberg machine that involves several bodies, springs, a pulley, and levers. The system is given force arrows that indicate direction and spoken relationships that give more cause/effect information.

The point of the system is that most current design systems, like CAD, require the user to put far too much thought and detail into early phases of the design. For instance, when trying to create a Rube Goldberg machine to crack an egg (the example above) that involves a spring, you don't want to have to specify things like the spring coefficient, or the elasticity of the various bodies, etc. Instead, you just want to get a rough idea to see if things work. Using ASSISTANCE, you first sketch a diagram, which is interpreted in an off-line manner by another program, ASSIST. You then annotate the diagram with various multi-modal explanations of the behavior (not specific parameters) of the system (using drawings--arrows--, speech, and pointing).

Given the explanation annotations, ASSISTANCE adds them to a rule-based system. The system can draw consistent conclusions from the set of behaviors, yielding a series of causal structures. These structures describe how things happen in the system (body 1 hits body 2 and causes it to move, etc.). Searching for various causal links can be time consuming, but the authors find it to be very quick most of the time.

No experimental results were given as to the system's efficacy besides their own experiences.

Discussion



Speaking seems like a nice way to get more information to the design program. This is the first paper that I've seen, not necessarily that was written, where I've seen sketching and speaking be combined in a multi-modal interface. It seems like some of the issues that plague both fields (context, intention, etc.) could be resolved, with one mode of input helping to clarify the other, etc. Generally, the more information we have (a user speaking about the sketch rather than just the static sketch), the better equipped we are to do a good job. However, the obvious problem is that both speech and sketch recognition are Hard Problems (TM) and can be very daunting on their own, let alone when combined. Luckily for Michael he used existing software.

The authors say that their system tends to not require exponential time to search the rule based system for a consistent set of causal links. However, the worst case running time is indeed exponential. It seems like they're just getting lucky because their domain is very small and limited. How much would adding a richer vocabulary and more domains increase th complexity of the system? Obviously exponentially. Would additions mean that the truth management system more often encountered exponential running times? Rule-based / knowledge-based deductive systems are neat, but they are extremely complex and in general are working to solve problems that are NP-hard, at least. Expecting to get good performance out of them, using anything other than an approximation, is foolhardy.

I wish there were results, at least a usability study where people rated their system. But alas, there were not. I really don't want to hear the author's opinions about their own system since the want to get published and aren't going to say anything like "Our system didn't work at all and was completely counter-intuitive." Not to say that it is, or that's how I'd feel if I tried it out. It's a hyperbole.

Gross and Do -- the Napkin

Gross, Mark, and Ellen Yi-Luen Do. "Ambiguous Intentions: a Paper-like Interface for Creative Design." ACM UIST 1996.

Summary



Gross and Do seek to implement a system where the designer can sketch plans for a system with a great deal of ambiguity and flexibility. This prevents the initial phase of the design from being too cumbersome and restrictive and saves possible interpretations and constraints for application later in the drawing process. This also relieves the system of some of its responsibilities as it is not forced to make accurate predictions for all drawn objects right away, some can be moved off to a later time when more contextual clues are available.

Their system recognizes drawn primitives (called glyphs, things like boxes, circles, etc., and also lines and points) using a simple templating approach. A stroke is analyzed and compared to other example strokes, being classified to a certain type of stroke if it matches any of the templates with a certain amount of certainty. Matches are ranked according to certain thresholds, with things like context helping to break close ties.

Context is established by the user giving the system examples. For instance, to say that four little squares sitting around a large square means a dining room table, the user would draw the boxes and edit the constraints on the various shapes. However, it might be the case that in certain contexts, those 5 boxes were a table (in a room plan for a house, for instance), or it might mean they are houses situated around a pond (in a landscaping drawing). Thus, the users can define different contexts that the system can identify by looking for symbols specific to one context or another. Contexts are kept in chains from specific to general. So for instance, both the context chains for room plans and landscaping would contain the symbols of 5 boxes, with different meanings in each chain.

Discussion



The definition of the constraints/contextual clues requires user input. It also seems like these constraints might be very rigid. Especially when the user has to draw examples of the constraints and context clues, this seems like a very time consuming process. Coupled with the difficulty of the system generating constraint lists from the drawings (which we saw in LADDER), this seems somewhat inefficient. However, I suppose that's the difficulty of describing constraints with a fixed language. It's hard enough for one person to describe geometrical layouts to another person using any vocabulary of their choosing. Intent is very difficult to capture, especially with a limited grammar.

Their system for matching glyphs seems very brute-force and hacked together, in my opinion. I think it would have been nice to see some results on their recognition rates. One problem that I can see is their system of splitting the stroke's bounding box into a grid. What happens if you need more detail? Your system won't be able to add new shapes that need finer granularity in their paths. Or, if you do add more granularity, permuting the possible paths based on rotation and reflectivity becomes more difficult. Also, templating requires a lot of overhead. You have to save all the templates and make O(n) comparisons, which is not the case for other classification methods (like linear classifiers). I can see the strength in templating (can help avoid user- and domain-dependent features like Rubine and Long) but it feels like their version is hacked together. Maybe because they're architects.

Tuesday, October 16, 2007

Herot - Graphical Input Throuch Machine Recognition of Sketches

Herot, Christopher. "Graphical Input Throuch Machine Recognition of Sketches." SIGGRAPH 1976 : 97-102

Summary



Herot's paper, from the 1970s, describes a system that seeks to find a way of recognizing sketching apart from the semantics of context and domain (i.e. recognizing low-level primitives and combining them hierarchically), a way of using domain context to construct higher-order shapes, and finally a way to allow the user direct involvement in the recognition process to tune the system's capabilities to that user's style.

Herot's ancient, by modern standards, system was a microcomputer, several types of tables for sketching input, and even a storage tube! Input into the system was via the passive tablet, over which a large piece of paper was taped and drawn on using a modified graphite pencil. The pencil's eraser allowed for drawing a "negative line" to remove points from the system. The pen's position was sampled at a constant rate determined by a user-programmable variable. The points were used by the HUNCH system to perform corner detection, using speed and curvature data. HUNCH fit both lines and a separate routine, CURVIT, would fit B-splines to strokes. Herot found, however, that setting the thresholds for recognizing lines/curves was difficult and tended to be user-dependent.

Herot's system also performed endpoint combination, called latching. In the first iteration, this used a fixed radius and joined any points within that radius. This gave error-prone results when the drawing was at a smaller scale than the radius. He tried to take into consideration the speed of the strokes to measure the amount of user intention in the endpoints, but still had problems especially in incorrectly latching 3D figures drawn on the 2D tablet. He handles overtracing by trying to turn several lines into one.

The second part of their system is to incorporate context of the architectural domain to give more understanding to the sketch system. Basically, he seeks to build a bottom-up hierarchical recognizer that uses context to put things together. He also looks at top-down approaches, finding a "room" by looking for its "walls". He notes the need for some sort of ranking system, or something similar, in order to allay the affects of "erroneous or premature matches."

Finally, Herot constructs an interactive system where the user can help guide the recognition decisions and modify the system's operating parameters through feedback so it adjusts to the user's personal drawing style.

Discussion



Wow! Curvature and speed data used for corner detection in the 1970s. So I guess Sezgin's work wasn't all that groundbreaking then? Sure, he did it on modern computers and did a much better job of it, but is that an artifact of improved technologies or his mighty powers as a researcher? Seeing this reference to curvature data almost 15-20 years before Sezgin's paper, I'm tending to lean toward simple advances in computing power and the level of knowledge in the field. Not to dismiss Sezgin's mightiness, on any account. He did things I could never do. But even he stands on the shoulders of his predecessors.

A nugget of brilliance:
A human observer resolves these ambiguities through the application of personal knowledge and years of learning how to look at pictures.... Before a machine can effectively communicate with a human user about a sketch [sic] it must possess a similar body of knowledge and experience.

Exactly! People keep crying out and gnashing their teeth for systems that capture human intention. They want systems that can perform better than humans! For simple systems and domains this is not a problem. In the future, I'm sure we can get into harder domains as the horsepower of the computers and intelligence of the algorithms increases. But humans grow up and learn these things over 20+ years with constant reinforcement-learning, and they automatically incorporate things like context and prior-knowledge that a computer can never know (that's a lot of knowledge to program and sift through). That's not something you can simply program and execute as a method call.

The construction of the hierarchical system was interesting to read over, seeing as how everything he called for now exists in come form or fashion. However, Herot was very pessimistic and ultimately incorrect that hierarchical learning would require context (primitives can be learned automatically--Paulson's recognizer), context is required "at the lowest levels," or that successful approaches would required knowledge-based systems (i.e. artificial intelligence, not the case anymore).

Lastly, it would appear Herot's tablet is passive since it just uses a pen, probably some sort of pressure sensitivity on a narrow tip. Does this prevent me from resting my arms/hand on it while I draw?

Thursday, October 11, 2007

Veselova - Shape Descriptions

Veselova, Olya and Randall Davis. "Perceptually Based Learning of Shape Descriptions for Sketch Recognition." AAAI 2004.

Summary



Veselova and Davis want to build a language for describing sketches arranged into iconic shapes and a system that uses said language to recognize instances of the shapes. Some of the difficulty with this problem lies in the subjective nature of perception, and what different people mean by "different"/"same" or features that are "significant" in making these judgment calls. The psychology of perception is studied and the authors propose a system of extracting the significant features and weighting them in order to make the decision on whether or not two shapes should be regarded as being the same.

Based on perceptual studies, the authors group the description vocabulary that is used to describe constraints between primitives in the shapes into two categories. The first group is the singularities. These are the properties of a shape where "small variations in them make a qualitative difference" in the perception of a shape. Examples of singularities are the notions of vertical, horizontal, parallel, straight, etc. The vocabulary was constructed and each constraint was examined on a set of shapes to determine subjectively which constraints made the most difference in determining similarity. Accordingly, each constraint is assigned a rank.

The importance of each constraint is not only defined by the rank with its peers, but is also adjusted using heuristics that take into account obstruction, grouping, and tension lines. Obstruction is a measure of how many primitives lie between objects 1 and 2. If the obstruction is high, the user will not be as likely to constrain a and b with each other. If, however, the obstruction is low, users will be able to easily constrain a and b with each other. Tension lines define how end- and midpoints of lines are aligned with each other. The human brain tends to like things well aligned, so if things in the shape are aligned, the constraints are boosted. Shapes that are grouped are also treated as wholes rather than individuals, so anything appearing in a group with each other has the constraints boosted, and things in different groups are not constrained as much.

To evaluate their work, the authors had 33 users look at 20 variations ("near-miss" examples similar to Hammond and Davis' work) if 9 shapes. The variations were chosen so that half still met the constraints and half did not. The users said "yes" or "no" to each variation if they felt it did/did not still count as an example of that shape. Their system was able to, where 90% of the users agreed between themselves on the right answer, get 95% accuracy (the same as any user selected at random) with the majority. For shapes where 80% of the people agreed, the system got 83% and any random user would get 91%. Thus, the system was able to, for the most part, give an answer for shape similarity that matched the majority of the human answers.

Future work for this project is to get a shape recognizer built that can use this system, be able to use more than just lines and arcs, build in a way of expressing extremes and "must not" constraints, and handle more than just pairwise constraints. Also, they want the system to be able to handle objects that have an arbitrary number of primitives (like a dashed line--how many dashes?).

Discussion



It sounds like the authors used their own opinions to determine the ranking values for the constraints listed in the table in Section 3.2. I wish they would have performed a user study. I can see where, on the one hand, they would be better suited to assigning these ranking values because they can understand the details of the problem. They have the expert knowledge of knowing what to look for and what might make the most difference. However, that same knowledge is what makes their opinion less credible. If you build a real-world system, it's not going to be used by experts 24/7. A bunch of naive and ignorant (to the domain and problem of sketch recognition, not in general) users are going to be making their own decisions as to what constitutes similarity and what things are important in making those judgment calls. In short, I would have liked to have seen more user input in the vocabulary ranking stage.

I wonder how the scalar coefficients were selected for the penalties on obstruction, tension lines, and grouping. How much do these values affect the final outcome of the prediction? I would like to see a graph of results vs. values for these variables. It just seems like these were sort of picked as a best-guess measure. Surely they weren't, but the paper gives me no other reason to believe otherwise. In their experiments section, the authors state that "people seemed to pay less attention to individual detail (aspect ratio, precise position, etc.) of the composing shapes in the symbol than the system biases accounted for." I wonder, then, if these biases could be adjusted by experimenting with varying the constraint ranking system and the coefficients listed above.

Also, they said they don't have a shape recognition engine built yet. So I am confused as to how they performed the experiments. Is it that they don't actually recognize any shapes, but they generate the 20 examples by hand and manually violate constraints? If so, this seems like a Bad Thing (TM) that is very prone to error and more subjectivity. Maybe they can borrow Hammond's near-miss generator.

But, after saying all that, I liked this paper. The authors seem to have found a way to say which constraints are more important than others. And, to their credit, they seem to get decent results agreeing with the majority of users. However, without having a sketch recognition engine to do these things automatically, I wonder how biased their results are.

Tuesday, October 9, 2007

Hammond and Davis -- Constraints and Near-Miss Examples

Hammond, Tracy and Randall Davis. "Interactive Learning of Structural Shape Descriptions from Automatically Generated Near-Miss Examples." IUI 2006.

Summary



Hammond and Davis are concerned with shape description languages and the way they are used by both human users and computer programs that automatically generate descriptions given a shape. The problem is that descriptions tend to either be over-constrained, in that the description is too specific to one particular example and not generalized enough, or under-constrained, that the description does not accurately capture all of the defining characteristics of the shape. The problem, then, is to create a system that can help the user identify both unneccesary and missing constraints using automatically generated near-miss examples, and having the user tell the learning algorithm whether the example is a positive instance of the shape or not.


The system starts with a hand-drawn shape and a set of constraints that provide a positive match for that shape. If the set of constraints is automatically generated from the shape, the match is guaranteed to be positive but lacks the flavor of capturing user intention. If the description is hand-written, user intention is captured, but the constraint list might not accurately reflect what the user intends. In either case, the set of constraints will probably need to be modified.

The authors start by removing constraints that are too specific (meaning the description is over-constrained). Constraints are tested by negating one constraint at a time and generating example shapes with that negated constraint. If the shape is said to still be a positive example of what the user intended, the negated constraint is said to be unneccesary and is discarded. Constraints that are needed will turn the generated shape into a negative example of the user's intention and will be kept. Shapes are tested for scaling and rotational invariance here, as requiring certain lines to be horizontal or a certain length is possibly an over-constrained description.

Second, the description is tested for being under-constrained by generating a list of possible constraints (using some heuristics and filters to keep the list to a reasonable size) that are not included in the shape description. Again, one at a time, the negation of these constraints is added and examples are generated. If the example is said to be positive, the constraint is truly not needed (the negation still held). If the example is said to be negative, the constriaint is needed (the negation was false -> not not -> true).

Shapes are generated using a set of algebraic equations that describe the shapes. The unknowns are solved using third party software (Mathematica). Once the system of equations is solved, the values can be used to draw the shape (if a solution exists).

Discussion



This is a pretty solid paper and is easy to understand. It doesn't feel like the authors have left out too much. One thing I am wondering, however, is the running time and storage requirements for the algorithm. The over-constrained case doesn't seem to be a problem because you're starting with a fixed set of constraints and pruning it. However, when adding constraints, even with the heuristic method, you're going to have a large number of constraints to add. Admittedly this is a heuristic, but is it a good one? Do the authors have any way to show that this does better than some other way? Even limiting the number to n^2, that's still n^2 systems of equations to solve, n^2 shapes to show to the user, and n^2 times the user has to say 'yes' or 'no'. Is that simply the price one has to pay to get a "perfect" description?

Overall I would like less "this is how we do it" and more "this actually works" out of the paper. How does the system do at generating final descriptions that work? How long does it take? How many constraints does it remove/add on average? It's a good idea, it sounds like, but is it worth it to me in the real world, or is it just something that's nice to talk about.

Thursday, October 4, 2007

Hammond and Davis -- LADDER

Hammond, Tracy, and Randall Davis. "LADDER, A sketching Language for User Interface Developers." Computer Graphics 29 (2005) 518:32.

Summary



Hammond and Davis want to create a system the enables the rapid development of systems using sketch recognition methods. They want the flexibility to allow the user to define shape descriptions and how to display those shapes, but not to burden the user down with the knowledge of how exactly to recognize those shapes given pen input.

The authors propose the LADDER language, which is composed of predefined shapes, constraints, editing behaviors, and display methods. The predefined shapes are simple primitives such as lines, arcs, circles, etc. Constraints (either hard--must be met--or soft--optional) can be placed on sets of these primitives to define higher-level shapes. Once a shape has been recognized, the user can specify how it is to be displayed. Also, he can specify what editing actions should be allowed to be performed on the shape. To generate the syntax for the shapes, the authors performed a user study where they asked 30 students to verbally describe shapes using increasingly limited vocabularies. Their syntax is shape based, which is more intuitive than a feature based syntax and is independent of drawing style. However, the language is limited in that the grammar is fixed and limited to very iconic, regular, and simple shapes, is limited to a set of primitive constraints, and is bad at handling curves.

Shapes are defined by a set of component shapes (either primitives or other defined shapes, constraints, aliases to simplify naming, and editing and display behaviors. Shapes may be abstract for hierarchical behaviors, and may be grouped into units for chaining behaviors. Vectors of components can be used to define a variable number of sub-shapes. Shapes can be edited after drawing by specifying editing rules.

Strokes are recognized as soon as they are drawn. The primitives are placed into a Jess knowledge-based system that is searched for any completed shapes. LADDER automatically provides for the automatic generation of the JESS code.

Discussion



The authors skirt around the issue of representing curves. They say it is hard to do and their system is not good at it, but then later say they have the primitives Bezier curve, curve, and arc. I wonder how the representations are different for each curve? How are they able to generalize these structures while still maintaining user-intended semantics? Is it good enough to just say "there is a curve here" or do the control points have to be strictly defined? Overall, I wish more information on curve-handling would have been provided.

I liked the use of isRotatable so that I can be very strict in defining relations between components, but then allow those components to be rotated along any axis. I wonder how much computation this adds to the recognition process?

Jess seems like a very complicated route to get shape recognition. I wonder if some sort of hierarchical data structure could have been used for recognition... Will have to think about this more.