Friday, December 14, 2007

Handwriting in LADDER

FLV video/demo

Presentation

All stuff Copyright 14 Dec. 2007, Joshua Johnston.

Thursday, December 13, 2007

Alvarado and Lazzareschi -- Properties of Diagrams

Alvarado, Christine, and Michael Lazzareschi. "Properties of Real-World Digital Logic Diagrams." PLT 2007.

Summary



Alvarado and Lazzareschi examine unconstrained sketches of digital logic diagrams that were created by students in a real-world setting: the class room. 13 students in a digital logic class were given Tablet PCs, and all notes and lab work, etc., was completed through sketching on the tablet. The authors took the first portion of the class's notes (the portion of the class dealing with low-level diagramming) and extracted 98 individual diagrams. The diagrams were then all labeled by hand to identify symbols. Analysis was performed on the labeled diagrams to determine the impact of stroke order, stroke timing, and number of strokes that students use to draw real sketches. This analysis sheds light on many of the underlying assumptions that sketch recognition algorithms make to boost their recognition rates.

The authors examined if stroke ordering could be used to help delineate different objects in the diagrams by seeing if students would complete a shape completely before moving on to start stroking in different locations. This was shown to be a false assumption almost 20% of the time. That is, of all the symbols in the diagram, 20% of them were drawn with nonconsecutive strokes. While many of these strokes seemed to be overtracing or touch-up, this is still a significant amount.

The authors next looked at the timing of the strokes from one object to another, seeing if different shapes in the diagram could be delineated using a certain threshold on pause time. The authors did find that, on average, users made shorter pauses between strokes in the same shape, and made longer pauses between shapes. This difference is significant using the Wilcoxon rank sum. However, there is a great deal of overlap in the amount of time user's paused. The authors looked at each user, determining the "optimal" amount of pause time to delineate strokes, and computed the recognition error rates that would result if timing information was the only basis for recognition (determining if this is a new symbol). At a minimum, 10% error would occur, at max 35% would occur, and on average 20% would occur. This is a significant amount of error.

Last, the authors looked at the numbers of strokes used to draw each symbol. The authors found that not only did the number of strokes per symbol vary from student to student and from symbol to symbol, but also that variations occurred within the same user drawing the same symbol. Moreover, some users were more consistent, while others showed more variation. However, the authors did conclude that for this domain (circuit logic diagrams), the overwhelming majority of all strokes were only used to draw symbols. This lends some support to the assumption that single strokes are not used to draw more than one symbol. However, the authors believe this is domain specific.

Discussion



I really liked this paper, as well as the other HMU paper about interface design issues. They have both shed a lot of light about what must be done for sketch recognition to be a viable player in the education community. I limit to this particular community because I wonder if some of the same problems would be found in a professional setting.

For instance, the authors talk about students returning to symbols for touch up or to overtrace, possibly while thinking about what to draw next. It seems like a professional would not have these tendencies, that they would be more inclined to "get it right the first time." No to bash on students, but they are, after all, students... and learning these things for the first time. Of course they aren't going to process things as quickly as a professional.

Also, I don't think some of the other data, such as using a single stroke to draw multiple symbols, or the number of strokes used to create symbols, would hold with professionals. First, I think that pros would have a very set and familiar way to draw symbols, so there would be little variation within different instances of their own symbols. Second, I think that with expertise comes a standard way of drawing things even across pros, or at least everyone gets more comfortable and the number of symbols tends to decrease. Also, I think the more experienced and comfortable you are with laying your diagram out "by the seat of your pants" with only a mental image in your head, you would start to do more and more things using single strokes from one shape to the next, only lifting your pen when you had to.

I also think the timings for intra- and inter-shape strokes would be closer together, as a pro does not have as much need to pause from one shape to the other. They know these things and can do them on the fly with little error.

Wais et al. -- Perception of Interface Elements

Wais, Paul, Aaron Wolin, and Christine Alvarado. "Designing a Sketch Recognition Front-End: User Perception of Interface Elements." Eurographics 2007.

Summary



Wais, Wolin, and Alvarado perform a Wizard of Oz study to examine several interface aspects of a sketch-based circuit diagram application. They wanted to examine three aspects of sketch recognition systems:

  1. When to perform recognition and and how to provide feedback about the results

  2. How the user can indicate which parts of the sketch she wants recognized

  3. Measure how recognition errors impact usability



The authors experimented with nine subjects, asking them to draw circuit diagrams with a pen-based input device (subjects had experience with the domain and devices both). Three sets of experiments tested the above aspects. For the first aspect, the system either used a button press, a check-tap gesture, or a delay of 4 seconds as a trigger for starting stroke recognition, with users preferring the button as it gave them reliable control (the gesture was difficult to recognize, 1/6 times worked). They preferred to use the button to recognize large chunks all at once. When recognition occurred, most users preferred the stroke lines to change color based on recognition status (recognized or not) rather than text labels to appear, as the labels cluttered the drawing. However, most users still hovered over the "recognized" portions (no actual recognition took place) to check to see if the label was correct. The system also introduced different types of random errors into the sketch. Errors themselves didn't seem to bother users too much, as long as they are predictable and lend themselves to learning and correction. Random errors did not provide this ability and were frustrating to users. Also, users want separate spaces, or separate colors, to determine what should be recognized in a sketch and what should be left alone.

Discussion



This was a neat paper. It was nice to see an interface paper saying what people actually liked and what works in a sketch recognition application. I'm not an interfaces person, so this was really the first time I had seen this addressed.

Regarding their pause trigger for recognition: it seems like this might be something that can be learned. Say you take the mean amount of time between the user's last pen stroke and when they press the 'recognize' button. This would capture their average amount of time. Of course, I think using a pause is a bad idea in general. Users expressed the desire for control and predictability. Give it to them rather than having a magical timeout that they neither see nor control.

I would have liked to have seen some interface issues concerning ways of correcting errors. Their method is just for the user to erase the problematic stroke and make it again. Obviously they're just using a dummy application for their Wizard of Oz approach, but I think it would be nice to have a drop down for an n-best list, one of the options being "None of the Above, Plan 9." This magical option would return the strokes to their original form and allow the users to group with a lasso or something the things that should be recognized. This seems like a burden to the user, especially when the push is for things that are fully automatic and uber-accurate. Well, maybe that's not realistic, at least not yet. Even if it is uber-accurate, it will still make mistakes. If the user can help it fix those mistakes, it can possibly learn. Or, at least make it easier for the user to clean up your program's boo boo.

Also, I've nearly decided that gestures are horrible. We already have an application with imperfect sketch recognition, and we're throwing in mega-imperfect gestures! I wonder if anyone knows if there is a good gesture toolkit out there, and I'm talking like 99.99%, none of this 1/6 (ONE CORRECT TO FIVE INCORRECT, 16%) laaaaaame product. But this isn't their fault, it's the gesture toolkit's fault.

Wednesday, December 12, 2007

Alvarado -- SketchREAD

Alvarado, Christine, and Randall Davis. "SketchREAD: A Multi-Domain Sketch Recognition Engine."

Summary



Alvarado presents a new engine for sketch recognition that uses dynamic Bayesian networks (DBNs). Incoming strokes are parsed, as they are drawn, into sets of low-level primitives using Sezgin/Stahovich's toolkit. These primitives are fed into DBNs that give probabilities for the strokes forming certain shapes. Templates of predefined DBNs are brought out and connected together to match the primitives and connect the primitives together in a logical fashion. DBNs are interconnected to form networks that yield probabilities for high level shapes and what the authors call domain shapes and domain patterns.

The DBNs are used for bottom-up recognition, where the primitives are pieced together into shapes, etc, and also for top-down recognition. The top-down portion of the program allows the engine to compensate for shapes that have not yet been finished. It also allows the system to revisit primitives and shapes and determine their accuracy. For instance, if a stroke is split into a 5-line polyline, but the domain has no 5-line shapes, the top-down recognizer can prune the 5-line hypothesis and generate others to test, such as a 2-, 3-, and 4-line hypothesis. Hypotheses are generated at each stage of recognition to see which possible interpretation of the strokes/shapes is most likely. The system has a pruning mechanism to prune those hypothesis without enough support in order to keep the state space from exploding, but this sometimes prunes correct hypotheses that just have not had enough time to develop (e.g. user still drawing finishing strokes). The top-down recognizer is expecting to re-add these pruned-but-correct hypotheses back in.

The engine was tested on two sets of sketches, 10 family trees that were relatively clean and simple, and 80 circuit diagrams, which were far more complex. On average, SketchREAD got 77% of the family tree shapes right, and 62% of the circuit shapes right.

One of the reasons SketchREAD performed "poorly" is because it makes several limiting assumptions to help avoid state-space explosion when searching for hypotheses. The first is that all the lines in a polyline are considered to be one shape, instead of trying each line separately. Additionally, the template DBNs to match strokes were often not matched correctly and the engine was not allowed to try all possible matches.

Discussion



This was a pretty neat approach. I like the way low-level errors were corrected later on. This seems like a must-have approach for any system that hopes to be widely successful. Of course, if you make better low-level recognizers, like Paulson's, this doesn't become too much of an issue. But still, nothing is perfect, as Alvarado mentions.

There were several limitations in the system that I wish Alvarado could get around. I understand they're their to limit the size of the hypothesis space, but they ended up making things hard for her to get right. One of them was that all the lines in a polyline were considered to be part of one shape. This killed them in the circuit domain, where a user must just go on and on drawing wires and resistors without ever lifting the pen. But, trying to break it up into n lines and process each of the n lines separately might be too time consuming. However, if a user had drawn all the lines separately, recognition rates would have been higher. Sure the system would have slowed down, but it's going to do that anyway the more you draw. Again, another tradeoff between accuracy and complexity.

Also, in Figure 8, they note the bottom-left ground is too messy for their top-down recognizer to correct. Well, honestly is looks fine to me. I would like a way to incorporate vision based approaches for exactly this kind of error. It might be too messy when you crunch numbers, but it /looks/ like a ground. Perhaps combining a vision-based likelihood could boost it up?

Sezgin -- Temporal Patterns

Sezgin, Tevfik Metin, and Randall Davis. "Sketch Interpretation Using Multiscale Models of Temporal Patterns." IEEE Sketch Based Interaction, Jan/Feb 2007: 28--37.

Summary



Sezgin seeks to model sketches by analyzing the order in which strokes are made. He constructs two types of models, a stroke level model which is trained on how the strokes are drawn in relation to one another. The model looks at the order certain stokes are drawn, with the assumption that objects are drawn in generally the same stroke order each time.

The second level of recognition occurs at the object level. This model combines the stroke-level models in a manner that describes in what order objects tend to occur in relation to one another. For instance, when drawing a resistor, users might usually draw a wire, then the resistor, then another wire.

Both models are represented as dynamic Bayesian networks (DBNs). DBNs capture the temporal dependencies of the strokes and objects, acting like first order Markov models. They are trained on a set of sketches. Each sketch is preprocessed and broken into a set of primitives (lines, arcs, etc.), for each of which an observation vector is created. The observation vectors are fed into the DBNs and the probabilities of state transitions, etc., are trained. For classification, a sketch is broken into primitives/observation vectors in the same way and fed into the DBNs. The representation with the highest likelihood is chosen as the correct classification.

Results were collected on eight users who drew 10 sketches of electrical diagrams each. They got about 80-90% accuracy over the entire sketch. One of the biggest weaknesses in their approach is that it is time-based, so any strokes drawn out of the expected order can throw a wrench in the works.

Discussion



This was a pretty neat approach to sketching, something that we've not seen before. I like the way that temporal patterns are looked at, because I do feel a lot of information can be extracted from stroke/object ordering. However, as shown in the paper, it can also be a hindrance. Strokes that are drawn out of expected order can cause recognition errors. This might be correctable with more training data, where outliers become more common and the system can account for them.

They might also be able to account for out of order strokes by considering adding a bit of ergodic behavior to the models. Right now everything moves left to right, from start to completion of each object, and then on to the next object. Adding ergodic behavior would allow back transitions, self transitions, and possible jumps from the middle of one object to another.

Sezgin mentioned that his PhD work dealt with creating a system that could "suspend" one object's model and start another if strokes occurred out of order. Of course, what happens then if you start a third? Is there a way to generalize this approach so you can have any number of n objects started at once?\

The recognition results weren't great, but this is a groundbreaking paper since nothing like this has been done, really. The recognition rates will get better as the models improve. I would like to see something like this combined with a vision based approach. A system where strokes and objects are identified not just in the order in which they appear, but also based on spatial arrangements and such. This might help eliminate some of the temporal confusion encountered by Sezgin, if something "far away" was not considered, even if it occurred next.

Tuesday, December 11, 2007

Wobbrock et al. -- $1 Recognizer

Jacob Wobbrock, Andrew Wilson, and Yang Li. "Gestures without Libraries, Toolkits, or Training: A $1 Recognizer for User Interface Prototypes." UIST 2007.

Summary



Wobbrock et al. present a very simple single-stroke recognizer that uses a template-based approach, named because it's cheap and easy to create. Their method works in four basic steps.

First, the stroke is resampled. Given a parameter N, the number of points in the resampled stroke, and the stroke length L, $1 interpolates N points (possibly new) along the original stroke path spaced L/N units apart. This ensures a uniform representation for fast/slow strokes to be compared to each other (fast strokes are less dense than slower strokes of the same length).

Second, the resampled stroke is rotated so that comparisons are more or less rotationally invariant. The angle of rotation is computed from the line segment connecting the center of the stroke's bounding box and the first point of the stroke, rotated to the horizontal axis at 0 degrees.

Third, the resampled and rotated points are scaled so that the bounding box is equal to a reference square of a certain size, disregarding original aspect ratio. The points are then translated so that the centroid (mean along x and y) is moved to the origin (0,0). These points are meant to make the strokes more scale invariant.

Fourth, the stroke is compared to each template using a series of small rotations to find the template that matches the best. The rotations are made between a min and max angle threshold using the golden ratio. The score rating the match between stroke and template is the average distance between all corresponding points of the stroke and template divided by half the length of the bounding box diagonal (a limit of the path distance, can't be further away from stroke to template than half the bounding box), subtracted from 1 (so higher scores are better). The templates are ranked according to score and returned.

The authors compared 10 entries of 16 gestures using their recognizer and variations of Rubine and a dynamic time warping algorithm. DTW and $1 perform almost indistinguishably, but $1 is much simpler and runs much faster.

Discussion



First my beef with the title: no training or libraries? What do you call the templates, one per possible type of shape you want to recognize? I understand their hyperbole, but it seems a little ridiculous, even for UIST.

It's important to remember the purpose of this recognizer. This is not meant to be the next groundbreaking method that pushes 99.99% accuracy in every possible domain. Even their title belies this admission on their own behalf: "for User Interface Prototypes." This, I believe, is meant to be something that you can just slap together an hour before a customer/investor comes over to get decent recognition results. The great part about this algorithm is that the entire source code is listed in the appendix, taking most of the mythos and shrouded mystery out of sketch recognition (no hidden Markov models or Bayes nets to code).

Thursday, November 8, 2007

Adler and Davis -- Speech and Sketching

Adler, A., and R. Davis. "Speech and sketching: An empirical study of multimodal interaction." EUROGRAPHICS, 2007.

Summary



Adler and Davis set up a user study to analyze the ways in which speech and sketch interact in a multimodal interface. They created software that runs on two Tablet PCs and shows on the other what is drawn on one in real time, offering a multitude of pen colors, highlighting colors, the ability to erase, and pressure-based line thickness. This allows a participant (one of 18 students) and an experimenter to interact with each other. The participants completed 4 drawings including electronics schematics, a floor plan, and a project. The schematics were available for study before the drawing began but not during, making all four tasks as free-form as possible.

The authors recorded the drawings, video, and audio and synchronized them all. They then labeled all the data for analysis. Some of the main things they found were:

  • Creation/writing strokes accounted for the super-majority of the strokes (90%) and ink (80%) in the sketch
  • Color changes were used to denote different groupings of strokes for emphasis
  • Most of the speech was broken and repetitive. This would make it hard for a full recognizer to analyze, but the repetitions provided clues about the user's intent.
  • Speech occurred at the same time the referenced objects were being drawn
  • The ordering of speech and drawing was the same
  • The open-ended nature of the sketching and speech interaction between the experimenter and participant evoked a lot of speech and clarification from the participant when asked simple questions by the experimenter
  • Parts of the speech that did not relate to the actual drawing itself gave hints as to the user's intentions


Analyzing these results, the authors found that for the most part, speech began before the sketching did when considering phrase groups broken up by pauses in the speech. However, when looking at word groups (saying "diode" when drawing one), the sketch usually comes first. Additionally, the amount of time difference between speech and sketch is statistically significant.

So, overall, giving the user colors to play with and using speech information gives a system a lot more information to use to do a better job.

Discussion



I found it interesting that the participants gave up more information than was needed or asked for when answering the experimenter's questions, as if wanting to make sure the exp. understood completely. I wonder if the user would give up as much information if there were not a human present? I hypothesize that a human would just expect a computer to magically understand what was meant and not talk to it. Even in a wizzard-of-oz experiment, with a human really posing as the computer, I think the participant would speak much less frequently.

I don't think the seemingly contradictory information in tables 1 and 2 is surprising. If I'm giving speech that is supposed to be informative, I'm going to include extra information in my speech that isn't directly related to the actual shapes I'm drawing. But, as I draw things, I will put words in my speech to let the observer know what I just drew.

I wonder how often pauses in the user's speech accurately reflected boundaries between phrase groups and drawings. How often did users just flow from one thing to the next? My guess is that pauses are pretty darn good indication, like speed corners, but I'm just wondering. Also, what is the reason for the differences in the speech/sketch alignments when comparing word groups and phrase groups?

The authors say that complete unrestricted speech recognition (narrative) is unduly difficult. Well, wasn't unrestricted sketch recognition the same way a decade ago? Things are hard. That's why PhD's have a job. If things were easy, you wouldn't have to study for 10 years just to get good at them. Unlimited speech recognition is unduly hard right now, and wasn't a tractable option for this paper, but it will come. Understanding "enough" of the input is good for now, but what about the other stuff, like the "junk" DNA. I bet it's important too.

Tuesday, November 6, 2007

Mahoney and Fromherz--Three main concepts

Mahoney, James V., and Markus P. J. Fromherz. "Three main concerns in sketch recognition and an approach to addressing them."

Summary



Mahoney and Fromherz identify three requirements that they fell any sketch recognition technology should possess. First, it should cope with ambiguity in the sketch. This includes variance due to sloppy drawings, difficulties in finding correct articulation in the drawings (corner finding), and noisy drawings (stuff in the background, etc). Second, the recognition process should be efficient and provide quick results. This means we must have methods to bound the normally exponential state-space into something that we can search in a tractable method. Third, the method must be extensible to new domains. It can't rely heavily on domain-specific knowledge.

Their method is to construct a graph of the segments in the image. Using image processing techniques (not stroke inputs) they detect segment endpoints and create nodes. The line the endpoints belong to is called a bond edge. Links between coincident endpoints are called links. The graph is put through a series of transformations that provide some variability in the possible interpretations of the structure. These include linking nodes that are close to each other (gap closure), linking around small segments if they are close to each other (removing corners, spurious segment jumping), adding nodes and links where lines intersect (adding corners to junctions), and combining corners where the segments are smooth (continuity tracing). Each of these methods has a modifiable parameter.

Subgraph matching is used to determine the best fits for portions of the graph. Shapes that should appear in the graph, and that are looked for by the algorithm, are described in a constraint language what defines the different shapes and how they are linked together. It also provides a criterion that the algorithm should try to minimize (sets of constraints, like LADDER's method of minimizing the ideal constraints). For instance, they say that a head should be half as long as a stick figure's torso, etc. They limit some of the search space using a few a priori assumptions and hints, which they call salience, anchoring, grouping, and model bias.

Discussion



They state it themselves, but hand-tuning the parameters for the perceptual organization methods for elaborating/rectifying the graph is bad. This completely violates their call for recognitions processes that are applicable across domains. They undermine their approach, set up on this idealistic claim based on their "three concerns", when they completely trash one of those concerns. To their benefit they do state the need to learn these or estimate them automatically.

In the worst case subgraph matching requires exponential time. They say this is not usually the case for "most practical problems." Defined how? Is something practical only if it conforms to the simple case of stick figures? What about something more general? Even if this isn't the case most of the time, if something does take exponential time (which you cannot guarantee that it won't), you've blown another tenet of your argument (the part about interactive recognition). There are so many a priori things they use to reduce the running time that I doubt the ability of this method to generalize to a multi-purpose recognizer. It seems like it has promise, but only at the risk of exponential running time.

Again, no recognition results. How obnoxious. I guess you can just toss tenet 1 out the door too, since you can't prove that your method does any good at handling ambiguity.

This seems like a good method to use with Sezgin for corner finding/reconciliation. I wonder if their results would have improved if they had stroke data available to them instead of simple image processing.

Sharon and van de Panne -- Constellation Models

Sharon, D., and M. van de Panne. "Constellation Models for Sketch Recognition." EUROGRAPHICS Workshop on Sketch-Based Interfaces and Modeling, 2006.

Summary



Sharon and van de Panne's approach to sketch recognition uses a set of shapes drawn with single strokes. These shapes are assigned a label based on maximum likelihood fits computed from spherical Gaussian distributions calculated using training examples. The label that results in the highest likelihood (given the Gaussian parameters for that label) is assigned to that shape. Pairs of shapes are compared against each other, and Gaussians are trained for all pairings (pruned down to be pairings between mandatory shapes, as opposed to shapes in a sketch deemed optional). Labels are chosen that also maintain a high likelihood between pairings of shapes.

Features of strokes include the center of the bounding box (x,y coordinates), the length of the diagonal of the bounding box, and the cos of the angle of said diagonal. Features between pairs of strokes include the differences between the center points (delta x, delta y) and the distance between the endpoint of one stroke and the closest point in the other stroke (and another feature the other way around). These feature vectors are computed and Gaussians are trained (four-dimensional spherical) with mean and diagonal covariance matrices calculated using training data. The models of the Gaussians are used to assign likelihoods to testing examples' feature vectors.

Label assignments are searched for using a branch-and-bound method. To avoid a simple exponential (number of labels^number of strokes) brute-force search, different bounding heuristics are used to reduce the search space. One method is to assign all possible mandatory labels and trim branches with smaller likelihoods. This is still bad because we're still exponential in the number of mandatory labels^# strokes. Another approach is to use a multi-pass thresholding algorithm. We start at some high threshold and cut off any developing branch with a likelihood less than the thresh. If no labeling is found, we iterate with increasingly low thresholds until a full labeling is assigned. The third approach is to set a minimum likelihood for any given edge node in the search tree. Any assignment of a label with a likelihood below this threshold would terminate that branch immediately. This is the type of multi-pass thresholding implemented (as opposed to decreasing thresholds of labelings-so-far). The authors also discuss using hard constraints, such as forcing mouth-parts to appear below nose-parts.

The author's don't give any recognition results, only amount of speed-up given by the multi-pass algorithm. I guess it doesn't do so hot.

Discussion




  • The features they use don't tell anything about the contents of the bounding box. It seems like using something like Paulson's MARQS recognizer would give better context-free recognition that's not dependent solely on the size/aspect of the stroke. This would probably help recognition and decrease the amount of time spent labeling the parts. Especially if employed in a bottom-up fashion.
  • On computing the likelihood of the model's likelihood:

    1. They can use the log-likelihood and change the products to sums, to save computation time. Maximizing log(x) also maximizes x.
    2. They assume independence between the probabilities (otherwise you can't multiply probabilities together). This is usually a safe assumption and greatly eases computation time. But, for instance, does assigning the label "nose" to this shape have any affect on the label of "mouth" for this other shape? Possibly, but learning joint probabilities requires a great deal of training examples. I advocate the independence assumption, but just wanted to point this out since they don't state it explicitly.
    3. They assume uniform priors. I don't agree with this assumption at all. They say it's because of a lack of training data. Well, you're training Gaussians with a lack of training data, so you might as well estimate priors as well. This would be helpful for things like optional shapes, where it might not be the case that a eyebrow appears. So if we have something that we think might be an eyebrow (especially because their individual shape recognition only uses bounding box information) we might want to weight that with the prior.

  • They have 20-60 training examples per shape. That's a lot! Plus they don't give recognition results, so we can only assume their method did horribly. I guess having that much information didn't help. Lack of training data is always an issue, so don't complain about it. If you can't get a good method for training priors, or hard constraints, using the limited data you have, you're doing it wrong.
  • Different classifiers work better for different amounts of training data. They don't have enough to fully describe a set of multivariate Gaussians and priors to use in a Baye's Classifier (ML or MAP). I think something like k-nearest neighbors would work well, using the feature vectors as points in some high-dimensional space.

Tuesday, October 30, 2007

Oltmans PhD Thesis

Oltmans, Michael. "Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches." MIT, May 2007. PhD Thesis.

Summary



Oltmans seeks a method of recognizing hand drawn shapes using an approach that harnesses a type of visual recognition rather than feature-based approaches (like RUbine or Sezgin). He analyzes both individually drawn, isolated shapes, and also automatically extracts and labels shapes from the domain of hand-drawn circuit diagrams.

Stroke preprocessing includes rescaling the shape and re-sampling the points in the stroke, interpolating new ones if needed. He then marches along the stroke and places a radial histogram ("bullseye") every five pixels, each shaped like a dart board with a certain number of wedges and log-spaces rings. Bullseyes are rotated with the trajectory of the stroke to make them rotationally invariant. The bullseyes have a third dimension, which is stroke direction through each bin split into 0-180 degress in 45 degree increments. The bullseyes count the number of ink pixels through each bin (location relative to the center + direction) and create a frequency vector.

The frequency vectors are compared against a codebook of shapes created by training on pre-labeled shapes. The labeled shapes are clustered (using QT clustering), with the mean-vector of each cluster being put into the codebook. An example shape that is to be classified has its constituent bullseye-vectors (its parts) compared to every shape in the codebook, giving a measure of distance (normalized sum of squared differences between vector elements). The min distance from each part to each element in the codebook gives a match-vector. The match vectors are classified using a support vector machine (one-to-one strategy) to match an example to its correct label.

To extract individual shapes from a full diagram, Oltmans marches along the strokes and creates windows of multiple, fixed sizes (to account for different scaling of the shapes). The ink in each window is classified, and if it is close enough to a given shape, it is put in the list of candidate shape regions. The candidate regions are clustered using EM, with large clusters being split into two regions and clustered again. The cluster representatives form the list of final candidate shapes, which are sent back through the classifier to obtain a shape label.

To evaluate his method, Oltman had 10 users draw circuit diagrams with a given set of shapes to use, but no instructions on how to draw them or lay the circuit out. The 109 resultant drawings were hand-labeled. Individual shapes were extracted from the drawing by hand and classified, giving 89.5% accuracy. He also evaluated a separate dataset for isolated shapes, HHreco, that had simpler shapes with little variance. He obtained 94.4% accuracy. When looking at how his method extracted shapes from the full circuit diagrams, 92.3% of the regions were extracted

Discussion



There are a lot of things I have to ask since it's a large body of work (not much to pick apart in 6 pages compared to 90). So I'll be brief and bullet-point things.


  1. Were any dimensionality reduction techniques employed, especially on the codebook, to try and figure out which parts were the most indicative of shape class? With so many clusterings and parts (every 5 pixels), it seems this might have been advantageous. It might also mean you can originally construct a full codebook (not limit to 1000 parts/100 clusters) and whittle it down from there.

  2. You seem to claim complete visual recognition and avoidance of style-based features, yet you use direction bins as features in the bullseyes. Does this really buy you anything, since a user can draw however she wishes? You're already orienting to the trajectory.

  3. When extracting features from circuit diagrams, you have a special WIRE class to catch your leftovers. How hard would this be to generalize? Would you just throw out anything that's not labeled with enough probability?

  4. In your results, you give confusion matrices with the recall for each class shape. It seems like the classes with more training examples had better recall. Did you find this to be strictly true (obviously it's true to some extent), or is it simply that some shapes are harder than others?

  5. Support vector machines are complex and very expensive, especially when you need O(n(n-1)/2) = O(n^2) of them (in the number of classes). Were any other classifiers tried? Something a little more simple?

  6. Did you experiment with the granularity of the bins, or was that the job of Belongie et al.? Did you try different spacings of the bullseyes (rather than every 5 pixels)?

Thursday, October 25, 2007

Oltmans -- ASSISTANCE

Oltmans, Michael, and Randall Davis. "Naturally Conveyed Explanations of Device Behavior." PUI 2001.

Summary



Oltmans created a system called ASSISTANCE that can take a recognized sketch (recognized by the ASSIST program) and speech input (vie IBM's ViaVoice program) and construct causal links for bodies in a mechanical engineering sketch. The example given in the paper is one of a Rube Goldberg machine that involves several bodies, springs, a pulley, and levers. The system is given force arrows that indicate direction and spoken relationships that give more cause/effect information.

The point of the system is that most current design systems, like CAD, require the user to put far too much thought and detail into early phases of the design. For instance, when trying to create a Rube Goldberg machine to crack an egg (the example above) that involves a spring, you don't want to have to specify things like the spring coefficient, or the elasticity of the various bodies, etc. Instead, you just want to get a rough idea to see if things work. Using ASSISTANCE, you first sketch a diagram, which is interpreted in an off-line manner by another program, ASSIST. You then annotate the diagram with various multi-modal explanations of the behavior (not specific parameters) of the system (using drawings--arrows--, speech, and pointing).

Given the explanation annotations, ASSISTANCE adds them to a rule-based system. The system can draw consistent conclusions from the set of behaviors, yielding a series of causal structures. These structures describe how things happen in the system (body 1 hits body 2 and causes it to move, etc.). Searching for various causal links can be time consuming, but the authors find it to be very quick most of the time.

No experimental results were given as to the system's efficacy besides their own experiences.

Discussion



Speaking seems like a nice way to get more information to the design program. This is the first paper that I've seen, not necessarily that was written, where I've seen sketching and speaking be combined in a multi-modal interface. It seems like some of the issues that plague both fields (context, intention, etc.) could be resolved, with one mode of input helping to clarify the other, etc. Generally, the more information we have (a user speaking about the sketch rather than just the static sketch), the better equipped we are to do a good job. However, the obvious problem is that both speech and sketch recognition are Hard Problems (TM) and can be very daunting on their own, let alone when combined. Luckily for Michael he used existing software.

The authors say that their system tends to not require exponential time to search the rule based system for a consistent set of causal links. However, the worst case running time is indeed exponential. It seems like they're just getting lucky because their domain is very small and limited. How much would adding a richer vocabulary and more domains increase th complexity of the system? Obviously exponentially. Would additions mean that the truth management system more often encountered exponential running times? Rule-based / knowledge-based deductive systems are neat, but they are extremely complex and in general are working to solve problems that are NP-hard, at least. Expecting to get good performance out of them, using anything other than an approximation, is foolhardy.

I wish there were results, at least a usability study where people rated their system. But alas, there were not. I really don't want to hear the author's opinions about their own system since the want to get published and aren't going to say anything like "Our system didn't work at all and was completely counter-intuitive." Not to say that it is, or that's how I'd feel if I tried it out. It's a hyperbole.

Gross and Do -- the Napkin

Gross, Mark, and Ellen Yi-Luen Do. "Ambiguous Intentions: a Paper-like Interface for Creative Design." ACM UIST 1996.

Summary



Gross and Do seek to implement a system where the designer can sketch plans for a system with a great deal of ambiguity and flexibility. This prevents the initial phase of the design from being too cumbersome and restrictive and saves possible interpretations and constraints for application later in the drawing process. This also relieves the system of some of its responsibilities as it is not forced to make accurate predictions for all drawn objects right away, some can be moved off to a later time when more contextual clues are available.

Their system recognizes drawn primitives (called glyphs, things like boxes, circles, etc., and also lines and points) using a simple templating approach. A stroke is analyzed and compared to other example strokes, being classified to a certain type of stroke if it matches any of the templates with a certain amount of certainty. Matches are ranked according to certain thresholds, with things like context helping to break close ties.

Context is established by the user giving the system examples. For instance, to say that four little squares sitting around a large square means a dining room table, the user would draw the boxes and edit the constraints on the various shapes. However, it might be the case that in certain contexts, those 5 boxes were a table (in a room plan for a house, for instance), or it might mean they are houses situated around a pond (in a landscaping drawing). Thus, the users can define different contexts that the system can identify by looking for symbols specific to one context or another. Contexts are kept in chains from specific to general. So for instance, both the context chains for room plans and landscaping would contain the symbols of 5 boxes, with different meanings in each chain.

Discussion



The definition of the constraints/contextual clues requires user input. It also seems like these constraints might be very rigid. Especially when the user has to draw examples of the constraints and context clues, this seems like a very time consuming process. Coupled with the difficulty of the system generating constraint lists from the drawings (which we saw in LADDER), this seems somewhat inefficient. However, I suppose that's the difficulty of describing constraints with a fixed language. It's hard enough for one person to describe geometrical layouts to another person using any vocabulary of their choosing. Intent is very difficult to capture, especially with a limited grammar.

Their system for matching glyphs seems very brute-force and hacked together, in my opinion. I think it would have been nice to see some results on their recognition rates. One problem that I can see is their system of splitting the stroke's bounding box into a grid. What happens if you need more detail? Your system won't be able to add new shapes that need finer granularity in their paths. Or, if you do add more granularity, permuting the possible paths based on rotation and reflectivity becomes more difficult. Also, templating requires a lot of overhead. You have to save all the templates and make O(n) comparisons, which is not the case for other classification methods (like linear classifiers). I can see the strength in templating (can help avoid user- and domain-dependent features like Rubine and Long) but it feels like their version is hacked together. Maybe because they're architects.

Tuesday, October 16, 2007

Herot - Graphical Input Throuch Machine Recognition of Sketches

Herot, Christopher. "Graphical Input Throuch Machine Recognition of Sketches." SIGGRAPH 1976 : 97-102

Summary



Herot's paper, from the 1970s, describes a system that seeks to find a way of recognizing sketching apart from the semantics of context and domain (i.e. recognizing low-level primitives and combining them hierarchically), a way of using domain context to construct higher-order shapes, and finally a way to allow the user direct involvement in the recognition process to tune the system's capabilities to that user's style.

Herot's ancient, by modern standards, system was a microcomputer, several types of tables for sketching input, and even a storage tube! Input into the system was via the passive tablet, over which a large piece of paper was taped and drawn on using a modified graphite pencil. The pencil's eraser allowed for drawing a "negative line" to remove points from the system. The pen's position was sampled at a constant rate determined by a user-programmable variable. The points were used by the HUNCH system to perform corner detection, using speed and curvature data. HUNCH fit both lines and a separate routine, CURVIT, would fit B-splines to strokes. Herot found, however, that setting the thresholds for recognizing lines/curves was difficult and tended to be user-dependent.

Herot's system also performed endpoint combination, called latching. In the first iteration, this used a fixed radius and joined any points within that radius. This gave error-prone results when the drawing was at a smaller scale than the radius. He tried to take into consideration the speed of the strokes to measure the amount of user intention in the endpoints, but still had problems especially in incorrectly latching 3D figures drawn on the 2D tablet. He handles overtracing by trying to turn several lines into one.

The second part of their system is to incorporate context of the architectural domain to give more understanding to the sketch system. Basically, he seeks to build a bottom-up hierarchical recognizer that uses context to put things together. He also looks at top-down approaches, finding a "room" by looking for its "walls". He notes the need for some sort of ranking system, or something similar, in order to allay the affects of "erroneous or premature matches."

Finally, Herot constructs an interactive system where the user can help guide the recognition decisions and modify the system's operating parameters through feedback so it adjusts to the user's personal drawing style.

Discussion



Wow! Curvature and speed data used for corner detection in the 1970s. So I guess Sezgin's work wasn't all that groundbreaking then? Sure, he did it on modern computers and did a much better job of it, but is that an artifact of improved technologies or his mighty powers as a researcher? Seeing this reference to curvature data almost 15-20 years before Sezgin's paper, I'm tending to lean toward simple advances in computing power and the level of knowledge in the field. Not to dismiss Sezgin's mightiness, on any account. He did things I could never do. But even he stands on the shoulders of his predecessors.

A nugget of brilliance:
A human observer resolves these ambiguities through the application of personal knowledge and years of learning how to look at pictures.... Before a machine can effectively communicate with a human user about a sketch [sic] it must possess a similar body of knowledge and experience.

Exactly! People keep crying out and gnashing their teeth for systems that capture human intention. They want systems that can perform better than humans! For simple systems and domains this is not a problem. In the future, I'm sure we can get into harder domains as the horsepower of the computers and intelligence of the algorithms increases. But humans grow up and learn these things over 20+ years with constant reinforcement-learning, and they automatically incorporate things like context and prior-knowledge that a computer can never know (that's a lot of knowledge to program and sift through). That's not something you can simply program and execute as a method call.

The construction of the hierarchical system was interesting to read over, seeing as how everything he called for now exists in come form or fashion. However, Herot was very pessimistic and ultimately incorrect that hierarchical learning would require context (primitives can be learned automatically--Paulson's recognizer), context is required "at the lowest levels," or that successful approaches would required knowledge-based systems (i.e. artificial intelligence, not the case anymore).

Lastly, it would appear Herot's tablet is passive since it just uses a pen, probably some sort of pressure sensitivity on a narrow tip. Does this prevent me from resting my arms/hand on it while I draw?

Thursday, October 11, 2007

Veselova - Shape Descriptions

Veselova, Olya and Randall Davis. "Perceptually Based Learning of Shape Descriptions for Sketch Recognition." AAAI 2004.

Summary



Veselova and Davis want to build a language for describing sketches arranged into iconic shapes and a system that uses said language to recognize instances of the shapes. Some of the difficulty with this problem lies in the subjective nature of perception, and what different people mean by "different"/"same" or features that are "significant" in making these judgment calls. The psychology of perception is studied and the authors propose a system of extracting the significant features and weighting them in order to make the decision on whether or not two shapes should be regarded as being the same.

Based on perceptual studies, the authors group the description vocabulary that is used to describe constraints between primitives in the shapes into two categories. The first group is the singularities. These are the properties of a shape where "small variations in them make a qualitative difference" in the perception of a shape. Examples of singularities are the notions of vertical, horizontal, parallel, straight, etc. The vocabulary was constructed and each constraint was examined on a set of shapes to determine subjectively which constraints made the most difference in determining similarity. Accordingly, each constraint is assigned a rank.

The importance of each constraint is not only defined by the rank with its peers, but is also adjusted using heuristics that take into account obstruction, grouping, and tension lines. Obstruction is a measure of how many primitives lie between objects 1 and 2. If the obstruction is high, the user will not be as likely to constrain a and b with each other. If, however, the obstruction is low, users will be able to easily constrain a and b with each other. Tension lines define how end- and midpoints of lines are aligned with each other. The human brain tends to like things well aligned, so if things in the shape are aligned, the constraints are boosted. Shapes that are grouped are also treated as wholes rather than individuals, so anything appearing in a group with each other has the constraints boosted, and things in different groups are not constrained as much.

To evaluate their work, the authors had 33 users look at 20 variations ("near-miss" examples similar to Hammond and Davis' work) if 9 shapes. The variations were chosen so that half still met the constraints and half did not. The users said "yes" or "no" to each variation if they felt it did/did not still count as an example of that shape. Their system was able to, where 90% of the users agreed between themselves on the right answer, get 95% accuracy (the same as any user selected at random) with the majority. For shapes where 80% of the people agreed, the system got 83% and any random user would get 91%. Thus, the system was able to, for the most part, give an answer for shape similarity that matched the majority of the human answers.

Future work for this project is to get a shape recognizer built that can use this system, be able to use more than just lines and arcs, build in a way of expressing extremes and "must not" constraints, and handle more than just pairwise constraints. Also, they want the system to be able to handle objects that have an arbitrary number of primitives (like a dashed line--how many dashes?).

Discussion



It sounds like the authors used their own opinions to determine the ranking values for the constraints listed in the table in Section 3.2. I wish they would have performed a user study. I can see where, on the one hand, they would be better suited to assigning these ranking values because they can understand the details of the problem. They have the expert knowledge of knowing what to look for and what might make the most difference. However, that same knowledge is what makes their opinion less credible. If you build a real-world system, it's not going to be used by experts 24/7. A bunch of naive and ignorant (to the domain and problem of sketch recognition, not in general) users are going to be making their own decisions as to what constitutes similarity and what things are important in making those judgment calls. In short, I would have liked to have seen more user input in the vocabulary ranking stage.

I wonder how the scalar coefficients were selected for the penalties on obstruction, tension lines, and grouping. How much do these values affect the final outcome of the prediction? I would like to see a graph of results vs. values for these variables. It just seems like these were sort of picked as a best-guess measure. Surely they weren't, but the paper gives me no other reason to believe otherwise. In their experiments section, the authors state that "people seemed to pay less attention to individual detail (aspect ratio, precise position, etc.) of the composing shapes in the symbol than the system biases accounted for." I wonder, then, if these biases could be adjusted by experimenting with varying the constraint ranking system and the coefficients listed above.

Also, they said they don't have a shape recognition engine built yet. So I am confused as to how they performed the experiments. Is it that they don't actually recognize any shapes, but they generate the 20 examples by hand and manually violate constraints? If so, this seems like a Bad Thing (TM) that is very prone to error and more subjectivity. Maybe they can borrow Hammond's near-miss generator.

But, after saying all that, I liked this paper. The authors seem to have found a way to say which constraints are more important than others. And, to their credit, they seem to get decent results agreeing with the majority of users. However, without having a sketch recognition engine to do these things automatically, I wonder how biased their results are.

Tuesday, October 9, 2007

Hammond and Davis -- Constraints and Near-Miss Examples

Hammond, Tracy and Randall Davis. "Interactive Learning of Structural Shape Descriptions from Automatically Generated Near-Miss Examples." IUI 2006.

Summary



Hammond and Davis are concerned with shape description languages and the way they are used by both human users and computer programs that automatically generate descriptions given a shape. The problem is that descriptions tend to either be over-constrained, in that the description is too specific to one particular example and not generalized enough, or under-constrained, that the description does not accurately capture all of the defining characteristics of the shape. The problem, then, is to create a system that can help the user identify both unneccesary and missing constraints using automatically generated near-miss examples, and having the user tell the learning algorithm whether the example is a positive instance of the shape or not.


The system starts with a hand-drawn shape and a set of constraints that provide a positive match for that shape. If the set of constraints is automatically generated from the shape, the match is guaranteed to be positive but lacks the flavor of capturing user intention. If the description is hand-written, user intention is captured, but the constraint list might not accurately reflect what the user intends. In either case, the set of constraints will probably need to be modified.

The authors start by removing constraints that are too specific (meaning the description is over-constrained). Constraints are tested by negating one constraint at a time and generating example shapes with that negated constraint. If the shape is said to still be a positive example of what the user intended, the negated constraint is said to be unneccesary and is discarded. Constraints that are needed will turn the generated shape into a negative example of the user's intention and will be kept. Shapes are tested for scaling and rotational invariance here, as requiring certain lines to be horizontal or a certain length is possibly an over-constrained description.

Second, the description is tested for being under-constrained by generating a list of possible constraints (using some heuristics and filters to keep the list to a reasonable size) that are not included in the shape description. Again, one at a time, the negation of these constraints is added and examples are generated. If the example is said to be positive, the constraint is truly not needed (the negation still held). If the example is said to be negative, the constriaint is needed (the negation was false -> not not -> true).

Shapes are generated using a set of algebraic equations that describe the shapes. The unknowns are solved using third party software (Mathematica). Once the system of equations is solved, the values can be used to draw the shape (if a solution exists).

Discussion



This is a pretty solid paper and is easy to understand. It doesn't feel like the authors have left out too much. One thing I am wondering, however, is the running time and storage requirements for the algorithm. The over-constrained case doesn't seem to be a problem because you're starting with a fixed set of constraints and pruning it. However, when adding constraints, even with the heuristic method, you're going to have a large number of constraints to add. Admittedly this is a heuristic, but is it a good one? Do the authors have any way to show that this does better than some other way? Even limiting the number to n^2, that's still n^2 systems of equations to solve, n^2 shapes to show to the user, and n^2 times the user has to say 'yes' or 'no'. Is that simply the price one has to pay to get a "perfect" description?

Overall I would like less "this is how we do it" and more "this actually works" out of the paper. How does the system do at generating final descriptions that work? How long does it take? How many constraints does it remove/add on average? It's a good idea, it sounds like, but is it worth it to me in the real world, or is it just something that's nice to talk about.

Thursday, October 4, 2007

Hammond and Davis -- LADDER

Hammond, Tracy, and Randall Davis. "LADDER, A sketching Language for User Interface Developers." Computer Graphics 29 (2005) 518:32.

Summary



Hammond and Davis want to create a system the enables the rapid development of systems using sketch recognition methods. They want the flexibility to allow the user to define shape descriptions and how to display those shapes, but not to burden the user down with the knowledge of how exactly to recognize those shapes given pen input.

The authors propose the LADDER language, which is composed of predefined shapes, constraints, editing behaviors, and display methods. The predefined shapes are simple primitives such as lines, arcs, circles, etc. Constraints (either hard--must be met--or soft--optional) can be placed on sets of these primitives to define higher-level shapes. Once a shape has been recognized, the user can specify how it is to be displayed. Also, he can specify what editing actions should be allowed to be performed on the shape. To generate the syntax for the shapes, the authors performed a user study where they asked 30 students to verbally describe shapes using increasingly limited vocabularies. Their syntax is shape based, which is more intuitive than a feature based syntax and is independent of drawing style. However, the language is limited in that the grammar is fixed and limited to very iconic, regular, and simple shapes, is limited to a set of primitive constraints, and is bad at handling curves.

Shapes are defined by a set of component shapes (either primitives or other defined shapes, constraints, aliases to simplify naming, and editing and display behaviors. Shapes may be abstract for hierarchical behaviors, and may be grouped into units for chaining behaviors. Vectors of components can be used to define a variable number of sub-shapes. Shapes can be edited after drawing by specifying editing rules.

Strokes are recognized as soon as they are drawn. The primitives are placed into a Jess knowledge-based system that is searched for any completed shapes. LADDER automatically provides for the automatic generation of the JESS code.

Discussion



The authors skirt around the issue of representing curves. They say it is hard to do and their system is not good at it, but then later say they have the primitives Bezier curve, curve, and arc. I wonder how the representations are different for each curve? How are they able to generalize these structures while still maintaining user-intended semantics? Is it good enough to just say "there is a curve here" or do the control points have to be strictly defined? Overall, I wish more information on curve-handling would have been provided.

I liked the use of isRotatable so that I can be very strict in defining relations between components, but then allow those components to be rotated along any axis. I wonder how much computation this adds to the recognition process?

Jess seems like a very complicated route to get shape recognition. I wonder if some sort of hierarchical data structure could have been used for recognition... Will have to think about this more.

Thursday, September 27, 2007

Paulson and Hammond -- New Features and Ranking

Paulson, Brandon, and Tracy Hammond. "Recognizing and Beautifying Low-Level Sketch Shapes with Two New Features and Ranking Algorithm." In submission, 2007.

Summary



Paulson and Hammond construct a system for recognizing low-level primitives made with single strokes, combining the primitives together in a hierarchical fashion and beautifying the strokes. Previous methods at this type of work used user- and style-dependent features (e.g., Rubine, Long) that required many training examples. Corner-detection methods (e.g., Sezgin, Yu and Cai) are able to find primitives but not construct complex hierarchies out of them for beautified strokes. The authors' improvements are adding two new features, plus a ranking system to choose the best of many possible fits to complex strokes.

The first stage of their system is pre-recognition computations, namely of features, overtrace detection, and stroke closure detection. The first of the two new features that are introduced are normalized distance between direction extremes (NDDE), which measures the ratio of the stroke that lies between direction extrema (arcs are high, polylines with many changes tend to be lower). The second new feature is direction change ratio (DCR), which is the ratio of maximum direction change to average direction change (similar to acceleration, polylines change often and dramatically and are high, arcs are smoother and are low). Testing for overtracing is performed by looking for constant direction for > 2*PI length. Stroke closure is tested by making sure the endpoints are "close enough" together.

The strokes are then fed into a series of primitive recognizers to test for lines, polylines, ellipses, circles, arcs, curves (Bezier of order < 5), spirals, and helixes. Complex shapes are constructed if none of the above tests for primitives pass by splitting the stroke on highest curvature point and recursively feeding the substrokes into the primitive recognizer (similar to Yu). Once a complex fit is obtained, the substrokes are examined to see if any can be recombined. The interpretations of the primitives/complex representation that are deemed to fit the stroke are ranked according to complexity. The ranking procedure tends to choose simpler models (in the number of primitives required to construct it), based on empirically-set costs for certain primitives.

Their system runs very quickly compared to a modified Sezgin recognizer, taking only half the time, on average, to classify a stroke (but both are very fast). Their method also achieved an impressive 98-99% accuracy classifying test shapes. their difficulties arose when the system had to decide between an accurate polyline fit and a simpler model of fewer primitives. Their method consistently outperformed Sezgin's algorithm, and is able to classify many more types of strokes.

Discussion



First, it's nice to see a paper that clearly defines their thresholds and constants. Second, it's nice to see a paper that gets such high accuracy levels on many types of primitives and combinations into complex shapes. I was starting to get discouraged that the problem of sketch recognition was 'too hard.'

The authors mention detecting overtraced circles. I wonder about overtraced lines, say if the user wanted to make there there was a connection between two objects. I assume that currently these would be treated like a polyline that just curves back and forth on itself. I wonder how difficult it would be to detect overtraced lines and construct 'heavy' lines out of them.

Also, the authors have no detection methods for anything rectangular. It would be nice to see 4 polylines that could be created into a square or rectangle. Furthermore, it seems like the application of the hierarchical fitting system could be extended to arbitrary regular polygons. Of course, the difficulty arises in deciding how many "automatic" shapes are too many to compare against. However, if the classification is hierarchical, perhaps these tests would only be performed if the stroke was closed, and then matched to the appropriate n-gon depending on the number of sub-strokes (lines in the polyline).

I thought the ranking system was very unique. I felt this was one of the best points of the paper and by the results in Table 1, it provides more accuracy by itself than just using the new features. I wonder if there is a different way to compute ranking costs other than empirical observation. Perhaps not, since judgments for accuracy are going to be empirical anyway (does the user think it looks right).

Abam - Streaming Algorithms for Line Simplificationm

Abam, Mohammad Ali, et al. "Streaming Algorithms for Line Simplification." SCG, 2007.

Summary



Abam et al. present a few algorithms for simplifying a set of points in a line into a simpler representation. The focus of the algorithms seem to be tracking an object through a space. After a certain amount of time, the finite storage space of the system doing the tracking is exhausted. To compact the path representation, line simplification algorithms are used to modify the stored path and reduce storage requirements.

They consider two distance metrics to measure the error of their simplified path from the real path. The Hausdorff error between a set of points is taken as the maximum distance between any pair of points in the two sets. This is also called complete-linkage distance as used in agglomerative hierarchical clustering. The other distance metric is called the Frechet distance, and can be thought of as the minimum length of a leash needed so that a man can walk on one line and his dog on the other line, where either may stand still but neither may walk backward.

The algorithms generally work by starting with a certain number of points in the line simplification, called set Q. As the algorithm progresses, the point in Q with the minimum error (distance from the true line) is removed from the simplification, so the simplification is a little worse, but not that much so.

Discussion



This was, simply put, a hard paper to read.

I can see the applications of this paper. For path simplification in a situation where the entire, detailed history of an objects motion is either not necessary or is too large to fit in memory, these algorithms work great. However, I'm not sure of the application of these algorithms to sketch recognition. It seems like we would lose far too much detail in the simplification to be of any use to recognizers later.

I think a better way to perform stroke compression for sketch recognition could be performed in two steps. First, resampling can be used to reduce the number of points and still maintain any granularity of detail desired. Second, a sketch is inherently made up of certain primitives that can be reduced into their cleaned versions to reduce the amount of space needed to represent them. For example, a line/polyline could be represented by just the corners/endpoints, an arc with endpoints and radius.

Thursday, September 20, 2007

Kim and Kim - Curvature Estimation for Segmentation

Kim, Dae Hyun, and Myoung-Jun Kim. "A Curvature Estimation for Pen Input Segmentation in Sketch-Based Modeling." 2006.

Summary



Kim and Kim look at the issue of using curvature to determine segmentation points for pen-drawn strokes. This is a recurring theme in the literature, including use in Sezgin and Yu. Kim and Kim make the contribution of looking at two new methods of computing curvature. They make the assumption that that every point drawn by the user is intentional and there is no noise in a gesture.

The authors perform a sort of smoothing on the dataset, resampling the stroke points to produce a set of data where the points are approximately a certain distance apart. The specifics of how this operation is carried out is not discussed, but they do say they use a distance of 5 pixels. This re-sampled set of points makes up a stroke. Changes in direction are computed as normal per point.

Curvature is defined to be the change in direction / the length this change occurs over. However, because resampling has occurred and each point is now approximately the same distance apart, curvature simplifies to just being the change in direction at each point.

The authors then propose two methods for adjusting the curvatures based on support neighborhoods around each point. The neighborhoods are initially of size k. So the support for point p_i would be points p_(i-k) through p_(i+k). The authors look at the convexity of the curvature changes moving away from p_i, adding the curvatures of point p_j (for i-k <= j < i, and i < j <= i+k, moving away from p_i in either direction) only if the curvature of p_i has the same sign as the curvature at point p_i. In other words, any curvature that is occurring in the same direction (clockwise or counter-clockwise) is added together to bolster the curvature of p_i (within the window of k points on either side of p_i).

This is a problem when there are several consecutive points (say on an arc) that all have the same sign. In this case, the curvature values of each point would all be equal, since they all get the sum of the curvatures in the neighborhoods. To fix this problem, the authors add an additional constraint. Not only must the curvature of neighboring points have the same sign, but the curvature values must be monotonically decreasing in order to be added. This has the effect of weighting certain points with a higher curvature.

The authors compare their curvature estimation methods with the circumscribing circle and curve bending function methods, finding theirs to be comparable and arguing their means are justified. They then show an evaluation experiment, showing the results of gesture segmentation based on support neighborhoods alone, using convexity information alone, using the convexity and monotonicity measures, and compared to another method. They claim 95% success in detecting the correct segmentation points.

Discussion



It seems strange to me that the authors assume that all the input points are intentional and that there is no noise. But then, they resample the data! If you assume no noise is in the data and that all the points are intentional, why do you throw away features by resampling? They show how it makes the curvature calculations simpler, but really, is computing the curvature really that hard to begin with? Even if you have to do a division by the length of the segment to calculate the initial curvature at each point, you can still use the convexity and monotonicity checks to weight each curvature. I also have a problem with the assumption that there is no noise to begin with. I understand that you either assume noise or you don't, and that either way has its pros and cons. But until input devices get 100% accurate, and until the striated muscle fibers in my arm, hand, and fingers quit tensing and relaxing multiple times per second (hold your hand out in front of you and try to hold it perfectly still--you can't unless you're a robot and you assume outside factors don't affect your motion), there will ALWAYS be noise. Now, the hard part is in determining what is true noise, and what are intentional deviations. And guess what, this is also impossible. Because what might look statistically like noise might really be me wanting to make intentional, small serrations in my otherwise-straight line. There is absolutely no way to guess user intention, in a general sense, with 100% accuracy (not even a human can do that 100% of the time, much less a computer)

</rant>

However, aside from that soap-box issue, I love the way that curvatures are weighted in the paper. It's nice to see curvature weighting methods, especially when I'm facing the problem in my own segmentation implementations of figuring out which of the gobs of curvature-based segmentation options to use. I also like how they use the speed information to change the threshold in determining if a curvature change is intentional or not, assuming that lower speeds mean more intentional measures and giving more weight to curvatures. I do wish, however, that this would have been discussed more than in passing.

Tuesday, September 18, 2007

Yu - Corners and Shape recognition

Yu, Bo, and Shijie Cai. "A Domain-Independents System for Sketch Recognition." 2003.

Summary



The authors present work very built off Sezgin's basic corner detection algorithms. Their system recognizes primitives (line segments, arcs, circles...) and puts them together via basic object recognition. Deficiencies in current systems they wish to address are strict restrictions on drawing style and inability to decompile curves (smooth or hybrid) into constituent primitives. To this end, they wish to allow users to draw naturally (multiple strokes are fine), provide consistent recognition results, group primitives hierarchically, beautify noisy strokes into higher level shapes.

First, the system performs stroke approximation immediately after the stroke is drawn. The points of the stroke, consisting of (x, y, timestamp) tuples, are examined direction and curvature graphs are computed, very similarly to Sezgin's methods (curvature here is slightly different). Vertices are declared to be the peaks in curvature. To approximate a stroke with primitives, the entire stroke is first attempted to be fitted with a primitive. If this fails, the stroke is split into substrokes at the point with highest curvature and approximation recurses on the substrokes until all portions of the stroke are approximated with a primitive.

Line approximation occurs first. A stroke is checked if it is a line by examining the direction graph (should be horizontal--no changes in direction) and the actual points in the stroke (should fit to a straight line). If either of these criterion fit to a best-fit line (using a least-squares fit) with few "deviations", a line is fit between the stroke endpoints. Valid fits are those that minimize the feature area to standard area ratio.

If a stroke is not a line, it is fed to the curve approximation routine. The direction graph for a smooth curve should be linear (constant changes in direction). Their system can fit circles, elipses, and arcs. It can even fit overtraced curves (the length of the direction graph is longer than 2*pi) by splitting the direction graph into lengths of 2*pi and fitting each length. Then, if the set of circles is similar, averaging them. Or, if the set of circles has decreasing radius, turn it into a helix. Like lines, circle and arcs are fit by computing the ratio of feature area to standard area, seeking to minimize this ratio.

Post processing includes removing noisy points (that occur at the beginning or end of a stroke), merging similar strokes, and inferring constraints such as connectivity, parallelism, and perpendicularity. Sets of heuristics can be used to combine primitives (such as four line segments) into higher level objects (such as rectangles).

Experimentally, there system is able to identify primitive shapes and polylines with 98% accuracy, and arcs alone with 94% accuracy (crude arcs are incorrectly recognized as a set of lines and arcs). Higher level shapes combining primitives only recognized at about 70%, with problems encountered with smooth connections (since their method /only/ uses curvature).

Discussion



It seems like their definition of curvature is different from Sezgin, normalizing the change in direction over a neighborhood by the stroke length. I wonder how this compares to Sezgin as far as vertex prediction. I was also surprised they did not use speed data. I think Sezgin made a good point for using both speed and curvature.

This paper is neat in that it uses primitive approximations to drive vertex detection, rather than finding vertices and then trying to fit primitives to the spaces between them. I think speed data could and should have been used in some sort of hybrid method like Sezgin, where instead of Sezgin's "just add until we get below a certain error", we pick the vertex with the highest certainty metric and use this to divide the stroke, trying to fit primitives to the substrokes. Basically, a combination of the two methods.

Again, we have a lot of talk about thresholds and limits without describing what these are, how they were derived, and why they are valid. I found it amusing that they say "here [looking at curvature data] we avoid any empirical threshold or statistical calculation for judging vertices." Well sure, because they push all that hand-waving down into the primitive approximation processes. Which, in the end, still equates to judging vertices. Examples of thresholds not well defined:

  • Line approx - number of points "deviating" from best-fit line
    • What does it mean to deviate?
    • What's the "relatively loose" threshold

  • Cleanup - "very short" line segments for deletion/merging, how short?

Thursday, September 13, 2007

Early Processing for Sketch Understanding

Sezgin, Tevfik Metin, et al. "Sketch BAsed Interfaces: Early Processing for Sketch Understanding." PUI 2001.

Summary



Sezgin et al seek to develop a system that allows users to directly draw sketches without restrictions (number of strokes, direction of strokes), and to have those sketches "understood" (approximation phase) into their component lines and arcs. The system can then combine these into low-level primitives (rectangles, squares, circles, etc.) as appropriate, beautify the drawing (second phase), and perform higher level understanding on the primitives (basic recognition phase).

Their approximation subsystem, which divides strokes into line segments and arcs, works by detecting 'corners', or where two line segments connect. Corners are detected by examining the speed (intuitively, users tend to slow down when they draw meaningful corners) and curvature (a corner is, obviously, a change in direction) of the sketch. Average based filtering is used to detect speed minima (low speeds denote corners), F_s, and curvature maxima (high change in direction denotes corners), F_d. The set of vertices that denote corners in the drawing is compiled from the sets F_d and F_s using an iterative technique that adds the most "certain" vertex from each set (certainty is computed differently for speed and curvature, and tells approximately how 'significant' the values of speed and curvature are at those points), picking between the most certain vertices using least squared error fitting. The iterations continue, adding points by both speed and curvature, until the error of the fit drops below a certain threshold.

Arcs are fit with Bezier curves (consisting of two endpoints and two control points). First, arcs are detected by looking at the distance between consecutive vertices (from corner detection) and computing a ratio with the total arc distance (sum of the distance between all the points between the vertices). Straight segments will have a ratio close to 1, while arced segments will have a ratio much higher than one (since the space between vertices is not a straight line, like the computed distance between vertices). Control points are then calculated based on the tangent vectors to the arc.

One lines and arce have been identified and represented using vertices and Bezier control points, the program can be beautified by not only drawing lines straighter, but also by rotating lines that are supposed to be parallel until they actually are. Beautified sketches can be compared to templates to perform basic object recognition.

They evaluated their system subjectively, using human subjects to say whether they liked the system or not (9/10 said they did). The authors also claim a 96% success rate in determining true vertices.

Their work is an improvement on other work because the vertex recognition is automatic, it allows for more natural sketching (the user does not have to lift the pen, stop drawing, or press buttons to denote vertices), and is completely automatic. In the future, the authors hope to use machine learning techniques (like scale space theory) to reduce the need for hard-coded thresholds. They also want to explore intentional changes in speed by the user as an indication of how precise they wish their sketch to be (requiring less beautification, presumably, for a careful sketch).

Discussion



There are so many details that are omitted, it's crazy! I understand the need to be concise and fit things into a specified number of pages, but these seem like major operating details.


  1. What is the error threshold to stop adding points to the final set of vertices, H_f? Why is this value hard-coded and not some sort of statistical test?

  2. In the average based filtering, why do we hard-code the mean threshold? They do cite this as a need for scale space theory, and they want to be simple, so it's understood.
  3. They talk about fitting an arc to a segment if the ratio of stroke length / euclidean distance is significantly higher than one. Significantly how and how much?
  4. When they rotate line segments to make things parallel/perpendicular, what happens to vertices that no longer touch (line segments don't overlap)? Or do they change connected lines at the same time?
  5. Why not beautify curves? Or do they? They weren't clear if they did or not, I don't think.

Monday, September 10, 2007

MARQS

Brandon Paulson and Tracy Hammond, MARQS: Retrieving Sketches Using Domain- and Style-Independent Features Learned from a Single Example Using a Dual-Classifier

Summary



Paulson and Hammond implement a system that can be used to search a database of sketches and rank matches from the database with a high degree of accuracy. The matching is performed with a new dual-classifier algorithm and new sketch features that allow for unconstrained sketches (ones that can exist with a different number of strokes, ordering of strokes, scale, or orientation), can be trained on a single example (the first example of a stroke in the database that we need to query against), and can learn over time with successful queries.

The four features used by the authors are the aspect ratio of the bounding box, the pixel density of the bounding box, the average curvature of the strokes in the sketch, and the number of perceived corners in the sketch. Before the features are computed, the images are rotated so that their major axis (defined as the line between the pair of points farthest from each other) is horizontal.

When only one gesture of a given class exists in the database, it is classified using a 1-nearest neighbors approach. That is, the features of the query sketch are compared to the features of all the sketches in the database, and the database sketch with the least amount of difference in feature values is selected as the match for that query. When more sketches are added to the database for that same gesture class, the algorithm begins to use a linear classifier (similar to Rubine's method) for greater accuracy and speed.

In an experiment, the authors had 10 users create 10 examples of their own sketch. The first is used as the initial key insertion into the database. The remaining 9 are used as queries into the database. Additionally, 10 examples each of 5 strokes were added by the authors to test the effects of orientation and scale more precisely. The experiment was run 10 times, with the ordering of the gestures used as insertions/queries to the database randomized. The average rank of the correct gesture given the query sketch was 1.51. 70% of the time, the right answer was the top choice, 87.6% in the top two, 95.5% in the top three, and 98% in the top four.

Discussion



The authors mentioned that using the single classifier to check against the first query of a gesture class takes a long time, proportional to O(nf) where n is the number of items in the database and f is the number of features. They mentioned using some sort of generalization of the features across all gestures in a particular class to speed things up, instead of comparing to every example within the class, but at the possible sake of accuracy. I wonder if it would be possible to use some sort of database indexing scheme based on the various features to get a quick lookup of matches. You may even use the generalized features to get a class of sketches that looks more like the query than the others, and then classify the sketches in that class for a more specific match against the query input.

It seems like orientation may be important for distinguishing between many gestures. I might not want a sad face to be classified the same as a happy face. The authors mentioned transitioning to a grid-based approach if this was the case. I wonder if there is some sort of thresholding that can take place so that large changes in orientation count as differences, but small changes in orientation are ignored. This might allow for a bit of wobble in the user's drawing, but allow for large differences (most likely made on purpose) to be counted differently. Of course, this would be impossible, it seems, given the current method of rotating the image to make its major axis horizontal. Unless one could take the orientation features before the rotation.

It also seems like a very difficult problem to extract images from an electronic journal, separating them from the other objects (other sketches, text, ...) that surround them on the digital page. In the future work this was mentioned as perceptual grouping and seems like a very difficult task. I can imagine scenarios where I draw a particular molecule for the chemistry e-notes. I then annotate the molecule with information like bond energies, etc. When I search for it later, I don't include the annotations, just the shape of the molecule. Is it possible to extract such information? Perhaps by using some sort of layering mechanism where the strokes are stored as layers on top of one another, so that the text annotations can be separated from the drawing of the molecule. Very difficult, indeed...

Tuesday, September 4, 2007

Visual Similarity of Pen Gestures

A. Chris Long, Jr., et al., Visual Similarity of Pen Gestures, 2000
A. Chris Long, Jr., et al., ”Those Look Similar!” Issues in Automating Gesture Design Advice, 2001

Summary



Long et al. set out to determine what makes gestures similar or dissimilar, based on human perception. Two sets of experiments were conducted to determine this. In the first experiment, 20 people were asked to look at 14 gestures in all possible groups of 3, picking the gesture from each triad that was the most dissimilar. The different gestures were of a wide variety of types and orientation. Geometrical properties of the gestures were taken into account and, using multi-dimensional scaling, were used to describe what made different gestures similar/dissimilar. Five dimensions of geometric attributes were selected, which were then mapped, using linear regression, to different features (i.e. Rubine features) of the gestures. The derived model correlated 0.74 with the experimental user perceptions of similarity.

The second trial was set up to test the predictive model of the first experiment, as well as test how varying assorted features affected similarity (total absolute angle and aspect, length and area, and rotation and its related features). Experiments were performed as in the first trial, where test subjects were shown triads and asked to choose the object that was the most dissimilar from the other two. When the three sets of varying features were fit to a predictive model, it was discovered that the angle of the bounding box and the gesture’s alignment with the coordinate axes played the most significant roles in determining similarity. It was also discovered that the model generated in the first trial outperformed the predictive power of the model in this trial.

Some of the more interesting results were that most of the similarities between gestures could be explained by a handful of features (that accounted for three of the dimensions in the MDS). The remaining MDS dimensions required many more features to describe. The authors attributed the difficulties of determining object similarity to the complexity of the models, the limitations on the amount of training data, and the subjectivity of a determination of similarity from one person to another.

In the shorter paper, Long, et al., use the similarity prediction models (trained on even more data) to create a tool to assist creators of gesture sets in creating gestures that are deemed dissimilar by people (so that gestures are easier to remember, this task performs horribly) and dissimilar by the computer (so that gestures can be accurately identified). The advice, as it is called, on gestures that are too similar is given after the user trains the new gesture class, as opposed to as soon as the first gesture example is drawn, because future strokes may altar the average features of the gesture class enough to make it significantly dissimilar to the computer. Or, if the analysis deems the gesture to not be distinguishable by humans, the advice is given immediately because more examples won’t change the basic construct.

Discussion



First, one of the equations in the long Long paper is wrong. The formula for the Minkowski distance metric should be (in LaTeX-ese):

d_{ij}^p = \left( \sum_a^r | x_{ia} – x_{ja} | ^p \right) ^{1/p}

The paper is missing the (...)^{1/p} around the summation. Otherwise, Euclidean distance would be missing its square root.

I thought it was interesting that while the log of the aspect is proportional to the actual aspect, the log of the aspect outperforms the plain aspect in determining similarity. This actually boggled me, as I’m used to treating things like log likelihoods and likelihoods like they’re basically equivalent (i.e. maximizing one obviously maximizes the other). So why don’t similarities in aspect land translate into similarities in log-aspect land?

It annoys me when people whine about not having enough training data to model all the connections between similarities, etc. Isn’t that the point? If you had an infinite amount of training data, all these fancy algorithms you use would be pretty worthless. All you’d have to do is look at all your data and pick the right answer. It’s like having an Oracle to solve the ATM (accepting Turing machine) problem. If you have an Oracle, the problem just seems to disappear. The point of this exercise is that you don’t have enough data. You will never have enough data. Even if you had enough data, you’d need some magical way to handle it all in an efficient manner to extract any information out of it in a useful amount of time. Your methods and algorithms should deal with this shortcoming in a way that is appropriate to the domain.

Other than that rant, it was neat to see how the authors used MDS to tie the psychological and very subjective portion of the experiment—determining gesture similarity—into something mathematically well defined and robust—a linear model fit with regression. I’m surprised that the models were able to do better than random at predicting which gestures were similar, given the complexity of the domain and the nuances in human judgment (where by nuances I mean silly fickleness). It was even more interesting to see that the majority of differences from one gesture to another could be accounted for by modeling only a small subset of the features.

And lastly, for my pithy quip:
One of these things is not like the others...One of these things does not belong...

Thursday, August 30, 2007

Specifying Gestures by Example

"Specifying Gestures by Example," Dean Rubine, 1991

Summary



Rubine developed an application that allows for the development of collections of highly accurate (97% accuracy) sketch recognizers to be built and trained quickly and easily. More accurately, Rubine’s methods do not apply to sketches, as one might commonly think of them, but are limited to single stroke (without lifting the “pen” from the "paper”) gestures. This limitation prevents the use of complex strokes, forcing some unnatural methods to draw certain gestures (i.e. drawing the letter “x” without lifting the pen) and introducing a bit of a learning curve, but simplifies the computations needed to recognize gestures immensely.

Built on top of his GRANDMA system, Rubine’s gesture-based drawing program (GDP) allows one to define gesture classes. A gesture class gives meaning to the gesture. For example, a gesture determined to belong to the “draw rectangle” gesture class will perform operations defined by this class. After defining a new gesture class, the user will input several examples of strokes used to denote that gesture, accounting for any variability in size/orientation of the stroke in the training examples. Semantic meanings can be defined for the gestures so operations can be performed once they are recognized.

Rubine defines 13 features that are used to describe a stroke, including data like initial angle, size and dimensions of the bounding box around the stroke, the amount of rotation in the figure, etc; each computed from the sample points that make up the “line” drawn by the stroke (each sample point contains an x and y coordinate along with a timestamp).

The features are used in a linear classifier to decide what gesture the features “look like.” Given a gesture class c, weights are computed for each feature in this class. The feature values for this stroke are multiplied by the weights for this class and summed together with a weight for this class in general. The class that gives the highest “sum of weights” is the class that the stroke most looks like, and the stroke is classified to that gesture. The weights are computed using the inverse covariance between features within a class. Inverse covariance is high when covariance is low, meaning independent features are weighted higher than highly correlated features. This makes sense because you don’t want highly correlated features to affect one another, you want clear-cut decisions. In the case that a stroke ambiguously looks like it can belong to more than one class, statistical methods can be employed to reject ambiguities that fall below a certain probability of being correct.

To improve on his method, Rubine suggests eager recognition—resolving a stroke’s class as soon as it becomes unambiguous, and multi-finger recognition—using systems that recognize two strokes at the same time—for possible adaptation to multi-stroke gestures.

Discussion



The gesture recognition system is incredibly simple on many fronts. It’s simple in that one can set up new gesture classes with a few mouse clicks, simple in that only a couple dozen training examples are needed per class, simple in its computation of various features, and extremely simple in its use of a linear classifier to do its work. However, it is also simple in the types of gestures that it can interpret (only single stroke, no segmented gestures) and the number of gesture classes that it can handle. In spite of its simplicity, or perhaps because of it, GDP seems to fill a nice niche. For simple sets of classes that can be drawn decently with one stroke (for example, the Latin alphabet), this method is extremely accurate. Obviously it can’t be used for more complex things (differentiating between a single stroke arm chair versus a single stroke couch in a system for interior designers might be a bit ridiculous), but its domain is not complex strokes.

One thing I was wondering about was regarding the eager recognition technique. Rabine mentions that a stroke is classified to its gesture class as soon as the classification becomes unambiguous. Without digging through the Internet to locate the referenced papers, I was trying to think of different ways he might do this. It seems like the easiest way would be to use the same probabilistic approach used to reject ambiguous classifications. Simply keep track of a percentage for each class, and when one class gets above some threshold (and the rest below that threshold) simply make the decision right then. It seems like you could do this fairly cheaply in terms of complexity (just constant-time updates to things such as total rotation or total distance, and simple recomputations of length from start to ending points).

Also, to get free training examples, I wonder if Rabine adds successfully classified strokes into the training set (requiring a fresh computation for the mean feature values, covariance matrices, and the weights). This would give free data that the user hasn’t “undone” and might help to learn a particular user’s style of creating the strokes (over time the user’s subtle nuances would dominate the feature means and allow the recognition to evolve).

And in final contemplation…Different strokes for different folks (kill me now, but I couldn’t resist the temptation).