Thursday, November 8, 2007

Adler and Davis -- Speech and Sketching

Adler, A., and R. Davis. "Speech and sketching: An empirical study of multimodal interaction." EUROGRAPHICS, 2007.

Summary



Adler and Davis set up a user study to analyze the ways in which speech and sketch interact in a multimodal interface. They created software that runs on two Tablet PCs and shows on the other what is drawn on one in real time, offering a multitude of pen colors, highlighting colors, the ability to erase, and pressure-based line thickness. This allows a participant (one of 18 students) and an experimenter to interact with each other. The participants completed 4 drawings including electronics schematics, a floor plan, and a project. The schematics were available for study before the drawing began but not during, making all four tasks as free-form as possible.

The authors recorded the drawings, video, and audio and synchronized them all. They then labeled all the data for analysis. Some of the main things they found were:

  • Creation/writing strokes accounted for the super-majority of the strokes (90%) and ink (80%) in the sketch
  • Color changes were used to denote different groupings of strokes for emphasis
  • Most of the speech was broken and repetitive. This would make it hard for a full recognizer to analyze, but the repetitions provided clues about the user's intent.
  • Speech occurred at the same time the referenced objects were being drawn
  • The ordering of speech and drawing was the same
  • The open-ended nature of the sketching and speech interaction between the experimenter and participant evoked a lot of speech and clarification from the participant when asked simple questions by the experimenter
  • Parts of the speech that did not relate to the actual drawing itself gave hints as to the user's intentions


Analyzing these results, the authors found that for the most part, speech began before the sketching did when considering phrase groups broken up by pauses in the speech. However, when looking at word groups (saying "diode" when drawing one), the sketch usually comes first. Additionally, the amount of time difference between speech and sketch is statistically significant.

So, overall, giving the user colors to play with and using speech information gives a system a lot more information to use to do a better job.

Discussion



I found it interesting that the participants gave up more information than was needed or asked for when answering the experimenter's questions, as if wanting to make sure the exp. understood completely. I wonder if the user would give up as much information if there were not a human present? I hypothesize that a human would just expect a computer to magically understand what was meant and not talk to it. Even in a wizzard-of-oz experiment, with a human really posing as the computer, I think the participant would speak much less frequently.

I don't think the seemingly contradictory information in tables 1 and 2 is surprising. If I'm giving speech that is supposed to be informative, I'm going to include extra information in my speech that isn't directly related to the actual shapes I'm drawing. But, as I draw things, I will put words in my speech to let the observer know what I just drew.

I wonder how often pauses in the user's speech accurately reflected boundaries between phrase groups and drawings. How often did users just flow from one thing to the next? My guess is that pauses are pretty darn good indication, like speed corners, but I'm just wondering. Also, what is the reason for the differences in the speech/sketch alignments when comparing word groups and phrase groups?

The authors say that complete unrestricted speech recognition (narrative) is unduly difficult. Well, wasn't unrestricted sketch recognition the same way a decade ago? Things are hard. That's why PhD's have a job. If things were easy, you wouldn't have to study for 10 years just to get good at them. Unlimited speech recognition is unduly hard right now, and wasn't a tractable option for this paper, but it will come. Understanding "enough" of the input is good for now, but what about the other stuff, like the "junk" DNA. I bet it's important too.

Tuesday, November 6, 2007

Mahoney and Fromherz--Three main concepts

Mahoney, James V., and Markus P. J. Fromherz. "Three main concerns in sketch recognition and an approach to addressing them."

Summary



Mahoney and Fromherz identify three requirements that they fell any sketch recognition technology should possess. First, it should cope with ambiguity in the sketch. This includes variance due to sloppy drawings, difficulties in finding correct articulation in the drawings (corner finding), and noisy drawings (stuff in the background, etc). Second, the recognition process should be efficient and provide quick results. This means we must have methods to bound the normally exponential state-space into something that we can search in a tractable method. Third, the method must be extensible to new domains. It can't rely heavily on domain-specific knowledge.

Their method is to construct a graph of the segments in the image. Using image processing techniques (not stroke inputs) they detect segment endpoints and create nodes. The line the endpoints belong to is called a bond edge. Links between coincident endpoints are called links. The graph is put through a series of transformations that provide some variability in the possible interpretations of the structure. These include linking nodes that are close to each other (gap closure), linking around small segments if they are close to each other (removing corners, spurious segment jumping), adding nodes and links where lines intersect (adding corners to junctions), and combining corners where the segments are smooth (continuity tracing). Each of these methods has a modifiable parameter.

Subgraph matching is used to determine the best fits for portions of the graph. Shapes that should appear in the graph, and that are looked for by the algorithm, are described in a constraint language what defines the different shapes and how they are linked together. It also provides a criterion that the algorithm should try to minimize (sets of constraints, like LADDER's method of minimizing the ideal constraints). For instance, they say that a head should be half as long as a stick figure's torso, etc. They limit some of the search space using a few a priori assumptions and hints, which they call salience, anchoring, grouping, and model bias.

Discussion



They state it themselves, but hand-tuning the parameters for the perceptual organization methods for elaborating/rectifying the graph is bad. This completely violates their call for recognitions processes that are applicable across domains. They undermine their approach, set up on this idealistic claim based on their "three concerns", when they completely trash one of those concerns. To their benefit they do state the need to learn these or estimate them automatically.

In the worst case subgraph matching requires exponential time. They say this is not usually the case for "most practical problems." Defined how? Is something practical only if it conforms to the simple case of stick figures? What about something more general? Even if this isn't the case most of the time, if something does take exponential time (which you cannot guarantee that it won't), you've blown another tenet of your argument (the part about interactive recognition). There are so many a priori things they use to reduce the running time that I doubt the ability of this method to generalize to a multi-purpose recognizer. It seems like it has promise, but only at the risk of exponential running time.

Again, no recognition results. How obnoxious. I guess you can just toss tenet 1 out the door too, since you can't prove that your method does any good at handling ambiguity.

This seems like a good method to use with Sezgin for corner finding/reconciliation. I wonder if their results would have improved if they had stroke data available to them instead of simple image processing.

Sharon and van de Panne -- Constellation Models

Sharon, D., and M. van de Panne. "Constellation Models for Sketch Recognition." EUROGRAPHICS Workshop on Sketch-Based Interfaces and Modeling, 2006.

Summary



Sharon and van de Panne's approach to sketch recognition uses a set of shapes drawn with single strokes. These shapes are assigned a label based on maximum likelihood fits computed from spherical Gaussian distributions calculated using training examples. The label that results in the highest likelihood (given the Gaussian parameters for that label) is assigned to that shape. Pairs of shapes are compared against each other, and Gaussians are trained for all pairings (pruned down to be pairings between mandatory shapes, as opposed to shapes in a sketch deemed optional). Labels are chosen that also maintain a high likelihood between pairings of shapes.

Features of strokes include the center of the bounding box (x,y coordinates), the length of the diagonal of the bounding box, and the cos of the angle of said diagonal. Features between pairs of strokes include the differences between the center points (delta x, delta y) and the distance between the endpoint of one stroke and the closest point in the other stroke (and another feature the other way around). These feature vectors are computed and Gaussians are trained (four-dimensional spherical) with mean and diagonal covariance matrices calculated using training data. The models of the Gaussians are used to assign likelihoods to testing examples' feature vectors.

Label assignments are searched for using a branch-and-bound method. To avoid a simple exponential (number of labels^number of strokes) brute-force search, different bounding heuristics are used to reduce the search space. One method is to assign all possible mandatory labels and trim branches with smaller likelihoods. This is still bad because we're still exponential in the number of mandatory labels^# strokes. Another approach is to use a multi-pass thresholding algorithm. We start at some high threshold and cut off any developing branch with a likelihood less than the thresh. If no labeling is found, we iterate with increasingly low thresholds until a full labeling is assigned. The third approach is to set a minimum likelihood for any given edge node in the search tree. Any assignment of a label with a likelihood below this threshold would terminate that branch immediately. This is the type of multi-pass thresholding implemented (as opposed to decreasing thresholds of labelings-so-far). The authors also discuss using hard constraints, such as forcing mouth-parts to appear below nose-parts.

The author's don't give any recognition results, only amount of speed-up given by the multi-pass algorithm. I guess it doesn't do so hot.

Discussion




  • The features they use don't tell anything about the contents of the bounding box. It seems like using something like Paulson's MARQS recognizer would give better context-free recognition that's not dependent solely on the size/aspect of the stroke. This would probably help recognition and decrease the amount of time spent labeling the parts. Especially if employed in a bottom-up fashion.
  • On computing the likelihood of the model's likelihood:

    1. They can use the log-likelihood and change the products to sums, to save computation time. Maximizing log(x) also maximizes x.
    2. They assume independence between the probabilities (otherwise you can't multiply probabilities together). This is usually a safe assumption and greatly eases computation time. But, for instance, does assigning the label "nose" to this shape have any affect on the label of "mouth" for this other shape? Possibly, but learning joint probabilities requires a great deal of training examples. I advocate the independence assumption, but just wanted to point this out since they don't state it explicitly.
    3. They assume uniform priors. I don't agree with this assumption at all. They say it's because of a lack of training data. Well, you're training Gaussians with a lack of training data, so you might as well estimate priors as well. This would be helpful for things like optional shapes, where it might not be the case that a eyebrow appears. So if we have something that we think might be an eyebrow (especially because their individual shape recognition only uses bounding box information) we might want to weight that with the prior.

  • They have 20-60 training examples per shape. That's a lot! Plus they don't give recognition results, so we can only assume their method did horribly. I guess having that much information didn't help. Lack of training data is always an issue, so don't complain about it. If you can't get a good method for training priors, or hard constraints, using the limited data you have, you're doing it wrong.
  • Different classifiers work better for different amounts of training data. They don't have enough to fully describe a set of multivariate Gaussians and priors to use in a Baye's Classifier (ML or MAP). I think something like k-nearest neighbors would work well, using the feature vectors as points in some high-dimensional space.