Thursday, September 27, 2007

Paulson and Hammond -- New Features and Ranking

Paulson, Brandon, and Tracy Hammond. "Recognizing and Beautifying Low-Level Sketch Shapes with Two New Features and Ranking Algorithm." In submission, 2007.

Summary



Paulson and Hammond construct a system for recognizing low-level primitives made with single strokes, combining the primitives together in a hierarchical fashion and beautifying the strokes. Previous methods at this type of work used user- and style-dependent features (e.g., Rubine, Long) that required many training examples. Corner-detection methods (e.g., Sezgin, Yu and Cai) are able to find primitives but not construct complex hierarchies out of them for beautified strokes. The authors' improvements are adding two new features, plus a ranking system to choose the best of many possible fits to complex strokes.

The first stage of their system is pre-recognition computations, namely of features, overtrace detection, and stroke closure detection. The first of the two new features that are introduced are normalized distance between direction extremes (NDDE), which measures the ratio of the stroke that lies between direction extrema (arcs are high, polylines with many changes tend to be lower). The second new feature is direction change ratio (DCR), which is the ratio of maximum direction change to average direction change (similar to acceleration, polylines change often and dramatically and are high, arcs are smoother and are low). Testing for overtracing is performed by looking for constant direction for > 2*PI length. Stroke closure is tested by making sure the endpoints are "close enough" together.

The strokes are then fed into a series of primitive recognizers to test for lines, polylines, ellipses, circles, arcs, curves (Bezier of order < 5), spirals, and helixes. Complex shapes are constructed if none of the above tests for primitives pass by splitting the stroke on highest curvature point and recursively feeding the substrokes into the primitive recognizer (similar to Yu). Once a complex fit is obtained, the substrokes are examined to see if any can be recombined. The interpretations of the primitives/complex representation that are deemed to fit the stroke are ranked according to complexity. The ranking procedure tends to choose simpler models (in the number of primitives required to construct it), based on empirically-set costs for certain primitives.

Their system runs very quickly compared to a modified Sezgin recognizer, taking only half the time, on average, to classify a stroke (but both are very fast). Their method also achieved an impressive 98-99% accuracy classifying test shapes. their difficulties arose when the system had to decide between an accurate polyline fit and a simpler model of fewer primitives. Their method consistently outperformed Sezgin's algorithm, and is able to classify many more types of strokes.

Discussion



First, it's nice to see a paper that clearly defines their thresholds and constants. Second, it's nice to see a paper that gets such high accuracy levels on many types of primitives and combinations into complex shapes. I was starting to get discouraged that the problem of sketch recognition was 'too hard.'

The authors mention detecting overtraced circles. I wonder about overtraced lines, say if the user wanted to make there there was a connection between two objects. I assume that currently these would be treated like a polyline that just curves back and forth on itself. I wonder how difficult it would be to detect overtraced lines and construct 'heavy' lines out of them.

Also, the authors have no detection methods for anything rectangular. It would be nice to see 4 polylines that could be created into a square or rectangle. Furthermore, it seems like the application of the hierarchical fitting system could be extended to arbitrary regular polygons. Of course, the difficulty arises in deciding how many "automatic" shapes are too many to compare against. However, if the classification is hierarchical, perhaps these tests would only be performed if the stroke was closed, and then matched to the appropriate n-gon depending on the number of sub-strokes (lines in the polyline).

I thought the ranking system was very unique. I felt this was one of the best points of the paper and by the results in Table 1, it provides more accuracy by itself than just using the new features. I wonder if there is a different way to compute ranking costs other than empirical observation. Perhaps not, since judgments for accuracy are going to be empirical anyway (does the user think it looks right).

Abam - Streaming Algorithms for Line Simplificationm

Abam, Mohammad Ali, et al. "Streaming Algorithms for Line Simplification." SCG, 2007.

Summary



Abam et al. present a few algorithms for simplifying a set of points in a line into a simpler representation. The focus of the algorithms seem to be tracking an object through a space. After a certain amount of time, the finite storage space of the system doing the tracking is exhausted. To compact the path representation, line simplification algorithms are used to modify the stored path and reduce storage requirements.

They consider two distance metrics to measure the error of their simplified path from the real path. The Hausdorff error between a set of points is taken as the maximum distance between any pair of points in the two sets. This is also called complete-linkage distance as used in agglomerative hierarchical clustering. The other distance metric is called the Frechet distance, and can be thought of as the minimum length of a leash needed so that a man can walk on one line and his dog on the other line, where either may stand still but neither may walk backward.

The algorithms generally work by starting with a certain number of points in the line simplification, called set Q. As the algorithm progresses, the point in Q with the minimum error (distance from the true line) is removed from the simplification, so the simplification is a little worse, but not that much so.

Discussion



This was, simply put, a hard paper to read.

I can see the applications of this paper. For path simplification in a situation where the entire, detailed history of an objects motion is either not necessary or is too large to fit in memory, these algorithms work great. However, I'm not sure of the application of these algorithms to sketch recognition. It seems like we would lose far too much detail in the simplification to be of any use to recognizers later.

I think a better way to perform stroke compression for sketch recognition could be performed in two steps. First, resampling can be used to reduce the number of points and still maintain any granularity of detail desired. Second, a sketch is inherently made up of certain primitives that can be reduced into their cleaned versions to reduce the amount of space needed to represent them. For example, a line/polyline could be represented by just the corners/endpoints, an arc with endpoints and radius.

Thursday, September 20, 2007

Kim and Kim - Curvature Estimation for Segmentation

Kim, Dae Hyun, and Myoung-Jun Kim. "A Curvature Estimation for Pen Input Segmentation in Sketch-Based Modeling." 2006.

Summary



Kim and Kim look at the issue of using curvature to determine segmentation points for pen-drawn strokes. This is a recurring theme in the literature, including use in Sezgin and Yu. Kim and Kim make the contribution of looking at two new methods of computing curvature. They make the assumption that that every point drawn by the user is intentional and there is no noise in a gesture.

The authors perform a sort of smoothing on the dataset, resampling the stroke points to produce a set of data where the points are approximately a certain distance apart. The specifics of how this operation is carried out is not discussed, but they do say they use a distance of 5 pixels. This re-sampled set of points makes up a stroke. Changes in direction are computed as normal per point.

Curvature is defined to be the change in direction / the length this change occurs over. However, because resampling has occurred and each point is now approximately the same distance apart, curvature simplifies to just being the change in direction at each point.

The authors then propose two methods for adjusting the curvatures based on support neighborhoods around each point. The neighborhoods are initially of size k. So the support for point p_i would be points p_(i-k) through p_(i+k). The authors look at the convexity of the curvature changes moving away from p_i, adding the curvatures of point p_j (for i-k <= j < i, and i < j <= i+k, moving away from p_i in either direction) only if the curvature of p_i has the same sign as the curvature at point p_i. In other words, any curvature that is occurring in the same direction (clockwise or counter-clockwise) is added together to bolster the curvature of p_i (within the window of k points on either side of p_i).

This is a problem when there are several consecutive points (say on an arc) that all have the same sign. In this case, the curvature values of each point would all be equal, since they all get the sum of the curvatures in the neighborhoods. To fix this problem, the authors add an additional constraint. Not only must the curvature of neighboring points have the same sign, but the curvature values must be monotonically decreasing in order to be added. This has the effect of weighting certain points with a higher curvature.

The authors compare their curvature estimation methods with the circumscribing circle and curve bending function methods, finding theirs to be comparable and arguing their means are justified. They then show an evaluation experiment, showing the results of gesture segmentation based on support neighborhoods alone, using convexity information alone, using the convexity and monotonicity measures, and compared to another method. They claim 95% success in detecting the correct segmentation points.

Discussion



It seems strange to me that the authors assume that all the input points are intentional and that there is no noise. But then, they resample the data! If you assume no noise is in the data and that all the points are intentional, why do you throw away features by resampling? They show how it makes the curvature calculations simpler, but really, is computing the curvature really that hard to begin with? Even if you have to do a division by the length of the segment to calculate the initial curvature at each point, you can still use the convexity and monotonicity checks to weight each curvature. I also have a problem with the assumption that there is no noise to begin with. I understand that you either assume noise or you don't, and that either way has its pros and cons. But until input devices get 100% accurate, and until the striated muscle fibers in my arm, hand, and fingers quit tensing and relaxing multiple times per second (hold your hand out in front of you and try to hold it perfectly still--you can't unless you're a robot and you assume outside factors don't affect your motion), there will ALWAYS be noise. Now, the hard part is in determining what is true noise, and what are intentional deviations. And guess what, this is also impossible. Because what might look statistically like noise might really be me wanting to make intentional, small serrations in my otherwise-straight line. There is absolutely no way to guess user intention, in a general sense, with 100% accuracy (not even a human can do that 100% of the time, much less a computer)

</rant>

However, aside from that soap-box issue, I love the way that curvatures are weighted in the paper. It's nice to see curvature weighting methods, especially when I'm facing the problem in my own segmentation implementations of figuring out which of the gobs of curvature-based segmentation options to use. I also like how they use the speed information to change the threshold in determining if a curvature change is intentional or not, assuming that lower speeds mean more intentional measures and giving more weight to curvatures. I do wish, however, that this would have been discussed more than in passing.

Tuesday, September 18, 2007

Yu - Corners and Shape recognition

Yu, Bo, and Shijie Cai. "A Domain-Independents System for Sketch Recognition." 2003.

Summary



The authors present work very built off Sezgin's basic corner detection algorithms. Their system recognizes primitives (line segments, arcs, circles...) and puts them together via basic object recognition. Deficiencies in current systems they wish to address are strict restrictions on drawing style and inability to decompile curves (smooth or hybrid) into constituent primitives. To this end, they wish to allow users to draw naturally (multiple strokes are fine), provide consistent recognition results, group primitives hierarchically, beautify noisy strokes into higher level shapes.

First, the system performs stroke approximation immediately after the stroke is drawn. The points of the stroke, consisting of (x, y, timestamp) tuples, are examined direction and curvature graphs are computed, very similarly to Sezgin's methods (curvature here is slightly different). Vertices are declared to be the peaks in curvature. To approximate a stroke with primitives, the entire stroke is first attempted to be fitted with a primitive. If this fails, the stroke is split into substrokes at the point with highest curvature and approximation recurses on the substrokes until all portions of the stroke are approximated with a primitive.

Line approximation occurs first. A stroke is checked if it is a line by examining the direction graph (should be horizontal--no changes in direction) and the actual points in the stroke (should fit to a straight line). If either of these criterion fit to a best-fit line (using a least-squares fit) with few "deviations", a line is fit between the stroke endpoints. Valid fits are those that minimize the feature area to standard area ratio.

If a stroke is not a line, it is fed to the curve approximation routine. The direction graph for a smooth curve should be linear (constant changes in direction). Their system can fit circles, elipses, and arcs. It can even fit overtraced curves (the length of the direction graph is longer than 2*pi) by splitting the direction graph into lengths of 2*pi and fitting each length. Then, if the set of circles is similar, averaging them. Or, if the set of circles has decreasing radius, turn it into a helix. Like lines, circle and arcs are fit by computing the ratio of feature area to standard area, seeking to minimize this ratio.

Post processing includes removing noisy points (that occur at the beginning or end of a stroke), merging similar strokes, and inferring constraints such as connectivity, parallelism, and perpendicularity. Sets of heuristics can be used to combine primitives (such as four line segments) into higher level objects (such as rectangles).

Experimentally, there system is able to identify primitive shapes and polylines with 98% accuracy, and arcs alone with 94% accuracy (crude arcs are incorrectly recognized as a set of lines and arcs). Higher level shapes combining primitives only recognized at about 70%, with problems encountered with smooth connections (since their method /only/ uses curvature).

Discussion



It seems like their definition of curvature is different from Sezgin, normalizing the change in direction over a neighborhood by the stroke length. I wonder how this compares to Sezgin as far as vertex prediction. I was also surprised they did not use speed data. I think Sezgin made a good point for using both speed and curvature.

This paper is neat in that it uses primitive approximations to drive vertex detection, rather than finding vertices and then trying to fit primitives to the spaces between them. I think speed data could and should have been used in some sort of hybrid method like Sezgin, where instead of Sezgin's "just add until we get below a certain error", we pick the vertex with the highest certainty metric and use this to divide the stroke, trying to fit primitives to the substrokes. Basically, a combination of the two methods.

Again, we have a lot of talk about thresholds and limits without describing what these are, how they were derived, and why they are valid. I found it amusing that they say "here [looking at curvature data] we avoid any empirical threshold or statistical calculation for judging vertices." Well sure, because they push all that hand-waving down into the primitive approximation processes. Which, in the end, still equates to judging vertices. Examples of thresholds not well defined:

  • Line approx - number of points "deviating" from best-fit line
    • What does it mean to deviate?
    • What's the "relatively loose" threshold

  • Cleanup - "very short" line segments for deletion/merging, how short?

Thursday, September 13, 2007

Early Processing for Sketch Understanding

Sezgin, Tevfik Metin, et al. "Sketch BAsed Interfaces: Early Processing for Sketch Understanding." PUI 2001.

Summary



Sezgin et al seek to develop a system that allows users to directly draw sketches without restrictions (number of strokes, direction of strokes), and to have those sketches "understood" (approximation phase) into their component lines and arcs. The system can then combine these into low-level primitives (rectangles, squares, circles, etc.) as appropriate, beautify the drawing (second phase), and perform higher level understanding on the primitives (basic recognition phase).

Their approximation subsystem, which divides strokes into line segments and arcs, works by detecting 'corners', or where two line segments connect. Corners are detected by examining the speed (intuitively, users tend to slow down when they draw meaningful corners) and curvature (a corner is, obviously, a change in direction) of the sketch. Average based filtering is used to detect speed minima (low speeds denote corners), F_s, and curvature maxima (high change in direction denotes corners), F_d. The set of vertices that denote corners in the drawing is compiled from the sets F_d and F_s using an iterative technique that adds the most "certain" vertex from each set (certainty is computed differently for speed and curvature, and tells approximately how 'significant' the values of speed and curvature are at those points), picking between the most certain vertices using least squared error fitting. The iterations continue, adding points by both speed and curvature, until the error of the fit drops below a certain threshold.

Arcs are fit with Bezier curves (consisting of two endpoints and two control points). First, arcs are detected by looking at the distance between consecutive vertices (from corner detection) and computing a ratio with the total arc distance (sum of the distance between all the points between the vertices). Straight segments will have a ratio close to 1, while arced segments will have a ratio much higher than one (since the space between vertices is not a straight line, like the computed distance between vertices). Control points are then calculated based on the tangent vectors to the arc.

One lines and arce have been identified and represented using vertices and Bezier control points, the program can be beautified by not only drawing lines straighter, but also by rotating lines that are supposed to be parallel until they actually are. Beautified sketches can be compared to templates to perform basic object recognition.

They evaluated their system subjectively, using human subjects to say whether they liked the system or not (9/10 said they did). The authors also claim a 96% success rate in determining true vertices.

Their work is an improvement on other work because the vertex recognition is automatic, it allows for more natural sketching (the user does not have to lift the pen, stop drawing, or press buttons to denote vertices), and is completely automatic. In the future, the authors hope to use machine learning techniques (like scale space theory) to reduce the need for hard-coded thresholds. They also want to explore intentional changes in speed by the user as an indication of how precise they wish their sketch to be (requiring less beautification, presumably, for a careful sketch).

Discussion



There are so many details that are omitted, it's crazy! I understand the need to be concise and fit things into a specified number of pages, but these seem like major operating details.


  1. What is the error threshold to stop adding points to the final set of vertices, H_f? Why is this value hard-coded and not some sort of statistical test?

  2. In the average based filtering, why do we hard-code the mean threshold? They do cite this as a need for scale space theory, and they want to be simple, so it's understood.
  3. They talk about fitting an arc to a segment if the ratio of stroke length / euclidean distance is significantly higher than one. Significantly how and how much?
  4. When they rotate line segments to make things parallel/perpendicular, what happens to vertices that no longer touch (line segments don't overlap)? Or do they change connected lines at the same time?
  5. Why not beautify curves? Or do they? They weren't clear if they did or not, I don't think.

Monday, September 10, 2007

MARQS

Brandon Paulson and Tracy Hammond, MARQS: Retrieving Sketches Using Domain- and Style-Independent Features Learned from a Single Example Using a Dual-Classifier

Summary



Paulson and Hammond implement a system that can be used to search a database of sketches and rank matches from the database with a high degree of accuracy. The matching is performed with a new dual-classifier algorithm and new sketch features that allow for unconstrained sketches (ones that can exist with a different number of strokes, ordering of strokes, scale, or orientation), can be trained on a single example (the first example of a stroke in the database that we need to query against), and can learn over time with successful queries.

The four features used by the authors are the aspect ratio of the bounding box, the pixel density of the bounding box, the average curvature of the strokes in the sketch, and the number of perceived corners in the sketch. Before the features are computed, the images are rotated so that their major axis (defined as the line between the pair of points farthest from each other) is horizontal.

When only one gesture of a given class exists in the database, it is classified using a 1-nearest neighbors approach. That is, the features of the query sketch are compared to the features of all the sketches in the database, and the database sketch with the least amount of difference in feature values is selected as the match for that query. When more sketches are added to the database for that same gesture class, the algorithm begins to use a linear classifier (similar to Rubine's method) for greater accuracy and speed.

In an experiment, the authors had 10 users create 10 examples of their own sketch. The first is used as the initial key insertion into the database. The remaining 9 are used as queries into the database. Additionally, 10 examples each of 5 strokes were added by the authors to test the effects of orientation and scale more precisely. The experiment was run 10 times, with the ordering of the gestures used as insertions/queries to the database randomized. The average rank of the correct gesture given the query sketch was 1.51. 70% of the time, the right answer was the top choice, 87.6% in the top two, 95.5% in the top three, and 98% in the top four.

Discussion



The authors mentioned that using the single classifier to check against the first query of a gesture class takes a long time, proportional to O(nf) where n is the number of items in the database and f is the number of features. They mentioned using some sort of generalization of the features across all gestures in a particular class to speed things up, instead of comparing to every example within the class, but at the possible sake of accuracy. I wonder if it would be possible to use some sort of database indexing scheme based on the various features to get a quick lookup of matches. You may even use the generalized features to get a class of sketches that looks more like the query than the others, and then classify the sketches in that class for a more specific match against the query input.

It seems like orientation may be important for distinguishing between many gestures. I might not want a sad face to be classified the same as a happy face. The authors mentioned transitioning to a grid-based approach if this was the case. I wonder if there is some sort of thresholding that can take place so that large changes in orientation count as differences, but small changes in orientation are ignored. This might allow for a bit of wobble in the user's drawing, but allow for large differences (most likely made on purpose) to be counted differently. Of course, this would be impossible, it seems, given the current method of rotating the image to make its major axis horizontal. Unless one could take the orientation features before the rotation.

It also seems like a very difficult problem to extract images from an electronic journal, separating them from the other objects (other sketches, text, ...) that surround them on the digital page. In the future work this was mentioned as perceptual grouping and seems like a very difficult task. I can imagine scenarios where I draw a particular molecule for the chemistry e-notes. I then annotate the molecule with information like bond energies, etc. When I search for it later, I don't include the annotations, just the shape of the molecule. Is it possible to extract such information? Perhaps by using some sort of layering mechanism where the strokes are stored as layers on top of one another, so that the text annotations can be separated from the drawing of the molecule. Very difficult, indeed...

Tuesday, September 4, 2007

Visual Similarity of Pen Gestures

A. Chris Long, Jr., et al., Visual Similarity of Pen Gestures, 2000
A. Chris Long, Jr., et al., ”Those Look Similar!” Issues in Automating Gesture Design Advice, 2001

Summary



Long et al. set out to determine what makes gestures similar or dissimilar, based on human perception. Two sets of experiments were conducted to determine this. In the first experiment, 20 people were asked to look at 14 gestures in all possible groups of 3, picking the gesture from each triad that was the most dissimilar. The different gestures were of a wide variety of types and orientation. Geometrical properties of the gestures were taken into account and, using multi-dimensional scaling, were used to describe what made different gestures similar/dissimilar. Five dimensions of geometric attributes were selected, which were then mapped, using linear regression, to different features (i.e. Rubine features) of the gestures. The derived model correlated 0.74 with the experimental user perceptions of similarity.

The second trial was set up to test the predictive model of the first experiment, as well as test how varying assorted features affected similarity (total absolute angle and aspect, length and area, and rotation and its related features). Experiments were performed as in the first trial, where test subjects were shown triads and asked to choose the object that was the most dissimilar from the other two. When the three sets of varying features were fit to a predictive model, it was discovered that the angle of the bounding box and the gesture’s alignment with the coordinate axes played the most significant roles in determining similarity. It was also discovered that the model generated in the first trial outperformed the predictive power of the model in this trial.

Some of the more interesting results were that most of the similarities between gestures could be explained by a handful of features (that accounted for three of the dimensions in the MDS). The remaining MDS dimensions required many more features to describe. The authors attributed the difficulties of determining object similarity to the complexity of the models, the limitations on the amount of training data, and the subjectivity of a determination of similarity from one person to another.

In the shorter paper, Long, et al., use the similarity prediction models (trained on even more data) to create a tool to assist creators of gesture sets in creating gestures that are deemed dissimilar by people (so that gestures are easier to remember, this task performs horribly) and dissimilar by the computer (so that gestures can be accurately identified). The advice, as it is called, on gestures that are too similar is given after the user trains the new gesture class, as opposed to as soon as the first gesture example is drawn, because future strokes may altar the average features of the gesture class enough to make it significantly dissimilar to the computer. Or, if the analysis deems the gesture to not be distinguishable by humans, the advice is given immediately because more examples won’t change the basic construct.

Discussion



First, one of the equations in the long Long paper is wrong. The formula for the Minkowski distance metric should be (in LaTeX-ese):

d_{ij}^p = \left( \sum_a^r | x_{ia} – x_{ja} | ^p \right) ^{1/p}

The paper is missing the (...)^{1/p} around the summation. Otherwise, Euclidean distance would be missing its square root.

I thought it was interesting that while the log of the aspect is proportional to the actual aspect, the log of the aspect outperforms the plain aspect in determining similarity. This actually boggled me, as I’m used to treating things like log likelihoods and likelihoods like they’re basically equivalent (i.e. maximizing one obviously maximizes the other). So why don’t similarities in aspect land translate into similarities in log-aspect land?

It annoys me when people whine about not having enough training data to model all the connections between similarities, etc. Isn’t that the point? If you had an infinite amount of training data, all these fancy algorithms you use would be pretty worthless. All you’d have to do is look at all your data and pick the right answer. It’s like having an Oracle to solve the ATM (accepting Turing machine) problem. If you have an Oracle, the problem just seems to disappear. The point of this exercise is that you don’t have enough data. You will never have enough data. Even if you had enough data, you’d need some magical way to handle it all in an efficient manner to extract any information out of it in a useful amount of time. Your methods and algorithms should deal with this shortcoming in a way that is appropriate to the domain.

Other than that rant, it was neat to see how the authors used MDS to tie the psychological and very subjective portion of the experiment—determining gesture similarity—into something mathematically well defined and robust—a linear model fit with regression. I’m surprised that the models were able to do better than random at predicting which gestures were similar, given the complexity of the domain and the nuances in human judgment (where by nuances I mean silly fickleness). It was even more interesting to see that the majority of differences from one gesture to another could be accounted for by modeling only a small subset of the features.

And lastly, for my pithy quip:
One of these things is not like the others...One of these things does not belong...