Thursday, August 30, 2007

Specifying Gestures by Example

"Specifying Gestures by Example," Dean Rubine, 1991

Summary



Rubine developed an application that allows for the development of collections of highly accurate (97% accuracy) sketch recognizers to be built and trained quickly and easily. More accurately, Rubine’s methods do not apply to sketches, as one might commonly think of them, but are limited to single stroke (without lifting the “pen” from the "paper”) gestures. This limitation prevents the use of complex strokes, forcing some unnatural methods to draw certain gestures (i.e. drawing the letter “x” without lifting the pen) and introducing a bit of a learning curve, but simplifies the computations needed to recognize gestures immensely.

Built on top of his GRANDMA system, Rubine’s gesture-based drawing program (GDP) allows one to define gesture classes. A gesture class gives meaning to the gesture. For example, a gesture determined to belong to the “draw rectangle” gesture class will perform operations defined by this class. After defining a new gesture class, the user will input several examples of strokes used to denote that gesture, accounting for any variability in size/orientation of the stroke in the training examples. Semantic meanings can be defined for the gestures so operations can be performed once they are recognized.

Rubine defines 13 features that are used to describe a stroke, including data like initial angle, size and dimensions of the bounding box around the stroke, the amount of rotation in the figure, etc; each computed from the sample points that make up the “line” drawn by the stroke (each sample point contains an x and y coordinate along with a timestamp).

The features are used in a linear classifier to decide what gesture the features “look like.” Given a gesture class c, weights are computed for each feature in this class. The feature values for this stroke are multiplied by the weights for this class and summed together with a weight for this class in general. The class that gives the highest “sum of weights” is the class that the stroke most looks like, and the stroke is classified to that gesture. The weights are computed using the inverse covariance between features within a class. Inverse covariance is high when covariance is low, meaning independent features are weighted higher than highly correlated features. This makes sense because you don’t want highly correlated features to affect one another, you want clear-cut decisions. In the case that a stroke ambiguously looks like it can belong to more than one class, statistical methods can be employed to reject ambiguities that fall below a certain probability of being correct.

To improve on his method, Rubine suggests eager recognition—resolving a stroke’s class as soon as it becomes unambiguous, and multi-finger recognition—using systems that recognize two strokes at the same time—for possible adaptation to multi-stroke gestures.

Discussion



The gesture recognition system is incredibly simple on many fronts. It’s simple in that one can set up new gesture classes with a few mouse clicks, simple in that only a couple dozen training examples are needed per class, simple in its computation of various features, and extremely simple in its use of a linear classifier to do its work. However, it is also simple in the types of gestures that it can interpret (only single stroke, no segmented gestures) and the number of gesture classes that it can handle. In spite of its simplicity, or perhaps because of it, GDP seems to fill a nice niche. For simple sets of classes that can be drawn decently with one stroke (for example, the Latin alphabet), this method is extremely accurate. Obviously it can’t be used for more complex things (differentiating between a single stroke arm chair versus a single stroke couch in a system for interior designers might be a bit ridiculous), but its domain is not complex strokes.

One thing I was wondering about was regarding the eager recognition technique. Rabine mentions that a stroke is classified to its gesture class as soon as the classification becomes unambiguous. Without digging through the Internet to locate the referenced papers, I was trying to think of different ways he might do this. It seems like the easiest way would be to use the same probabilistic approach used to reject ambiguous classifications. Simply keep track of a percentage for each class, and when one class gets above some threshold (and the rest below that threshold) simply make the decision right then. It seems like you could do this fairly cheaply in terms of complexity (just constant-time updates to things such as total rotation or total distance, and simple recomputations of length from start to ending points).

Also, to get free training examples, I wonder if Rabine adds successfully classified strokes into the training set (requiring a fresh computation for the mean feature values, covariance matrices, and the weights). This would give free data that the user hasn’t “undone” and might help to learn a particular user’s style of creating the strokes (over time the user’s subtle nuances would dominate the feature means and allow the recognition to evolve).

And in final contemplation…Different strokes for different folks (kill me now, but I couldn’t resist the temptation).

Sketchpad

“Sketchpad: A Man-Machine Graphical Communication System,” Ivan Sutherland, 1963

Summary



The Sketchpad system was a 1960s era computer that allowed for dynamic and precise graphical input using a light pen and a bank of buttons, knobs, and toggle switches. The extremely modal operation did not allow for context-driven interaction, each action required the pressing of a specific button or sequence of buttons, possibly in addition to actions with the light pen (the primary source of input).

Sketchpad was fairly revolutionary in that it was one of the first systems to pave the way for the modern concepts of object oriented programming and design. Sketchpad allowed for drawings to be saved as symbols. These symbols could be opened in subsequent drawings as instances, or sub-pictures, and manipulated individually (scaling, rotating, etc., but not fundamentally changing the object’s definition). Constraints could also be placed on drawings to force relationships between objects to help beautify and control the precision of a drawing (making lines parallel, the same length, etc.). Changes to the base symbol resulted in identical changes to every stored instance of that symbol within a sub-picture.

Components of similar type (lines, points, constraints, etc.) were stored in circular doubly-linked lists, called rings. Additionally, all objects with common traits (such as all line segments terminating at a common point) were linked together. These lists allowed for easy insertion of new objects, deletion of old object, and operations such as the merging of different objects (merging two points would modify the line segments that terminated at those points). The generic blocks stored in the linked lists provided easy management and extensibility of specific code, precursors to the specific ideas of polymorphism and inheritance. Constraints are also represented as objects in rings, connected to the symbols and variables they control.

Discussion



The obvious strengths of this paper, things that were groundbreaking and fairly revolutionary for their time, were the ideas that later became to be known formally as object oriented programming, polymorphism, and inheritance. It was interesting to read that one could manipulate a symbol's definition, for example changing the hexagons to fish scales, and the change would be reflected immediately in all instances of that symbol. The generic structure of objects so that the system was extensible was not only ahead of its time, but also fit the idea of inheritance and polymorphism.

One of the weaknesses is the horrible complexity of the Sketchpad system. Its modal input system, with the incredible number of buttons, knobs, and toggle switches, makes the use of Sketchpad limited to only a few people beyond its creator. However, I don't think its fair to fault Sketchpad too much for this. At the time of its creation, 1963, the computers that existed didn't have incredibly complex operating systems or SDKs to allow for the creation of very flexible programs. Hardware was still very tied to the systems, the transistor was pretty much brand new (field effect transistors had not even been invented yet), and computer systems were fairly specifically oriented. And, despite the complexity of its interface, Sketchpad was able to do some fairly amazing graphics manipulation for its time.

Introduction to Sketch Recognition

“Introduction to Sketch Recognition,” Tracy Hammond and Kenneth Mock

Summary



Pen-based interfaces can be extremely useful, intuitive, and accurate in many domains, much more than the current use of a mouse. Different systems that use pen-input include Tablet PCs, interactive white-/blackboards in classroom use, and personal data assistants (PDAs). All of these devices can either be passive, where a stylus is used on a touch screen, a “tap” being equivalent to a mouse click, or active , where a special pen uses electromagnetic signals to relay it’s position to the display, moving the cursor without having to tap the pen to the surface.

One domain where pen-input systems using digital ink (saving a sketch as the raw sketch and not performing any recognition procedures to deduce meaning) or sketch recognition systems (trying to deduce meaning from drawings) excels is in the classroom. In conjunction with other software/hardware combinations such as digital projectors and screen capture/recording programs (including audio), pen-input systems can be used to add dynamic content to static lectures, provide a medium for distributing, receiving, and grading electronic version of homework assignments, and helps provide immediate feedback within the classroom setting. The notes jotted on lecture slides can be saved as digital ink, whereas supplementary materials like chemical formulae, mathematical equations, and physics demonstrations can be drawn/written and passes into a sketch recognition program to build 3D molecule models, plot function graphs, and animate physical interactions.

The FLUID framework (based on the LADDER and GUILD sketch recognition system), allows users to define new domains and systems to recognize new sets of sketches. Providing a method for obtaining pen-input data, the framework will assign meaning to the drawings based on the user’s specifications, and then optionally pass this data on to another program (such as a CAD system for manipulating physical models of real world objects).

Discussion



I was skeptical about the authors’ claims that mice “do not have the natural feel of a pen, nor [do they] provide a pen’s accuracy.” Being a long time computer user and owning a high grade optical mouse, I am quite satisfied with my level of accuracy and comfort with a mouse. However, I realize I am the exception and that everyone is familiar with a pen and anyone that can write on paper can write on a pen-input device with very little training curve. Getting that accuracy out of a mouse takes quite a bit more use and practice. Additionally, after using both the Wacom monitors and Table PCs in the lab, I realize just how easy it is to use a pen-input device. Yes, it is much more intuitive than a mouse.

One of the ideas I like most about this paper is the use of digital ink for white-/blackboard presentations and lecture annotations. While in class, I almost constantly wish for some easy method to capture the content of the blackboard. I can transcribe what the professor writes into my own notes, but again I’m left without the flexibility of editing, copying, pasting, and “time lapse” that digital ink could provide. The paper left me very excited about the possibilities of pen-input and sketch recognition systems.

Wednesday, August 29, 2007

About Me

My name is Joshua Johnston.

I am a first year PhD at Texas A&M University with an MS Computer Science from Baylor University.

My email address is myfirstname.mylastname at NEO-TAMU (or jbjohns AT CSDL, which just forwards to my NEO account)



My wife of 5 years was the Study Abroad Coordinator for the Computer Science Department at Baylor University for their trip to Shanghai, China. I got to tag along. This photograph was taken in the garden at the base of the Oriental Pearl Tower, Asia's highest tower and the world's third highest tower.

My academic interests lie generally in the area of machine learning and artificial intelligence. My Master's thesis was specifically about unsupervised learning--applying mixture models of the exponential Dirichlet compound multinomial to cluster basic block vector data for the SimPoint application. Though still interested generally in ML and AI, I'd like to use the resources presented by a larger university and larger Computer Science department to see about applying ML and AI techniques to different areas, and to explore other areas not related to clustering to see if I might enjoy research in them.

As stated, I have my MS Computer Science already, with my thesis on unsupervised learning, and also a BS Computer Science. I have a great deal of programming experience in Java (language of personal choice, with many hours of use in both academia and industry) and MATLAB. I can code in C++ and am fairly fluent, but prefer Java (on a strictly I-use-it-more basis).

I am taking this class because there are a lot of opportunities to increase my ML/AI experience and apply the two fields to new domains (new to me). Plus, it's neat to see how the "magic" of a handwriting recognition system (for example) really works. One of the first applications I coded in my machine learning class was a classifier for handwritten zip-code digits. It's a hard problem, to be sure, and I'd like to see more information on what the state of the art is.

From this class, I hope to gain a new insight into how the difficult problems associated with assigning meaning to user input are solved in this domain. I've seen firsthand how difficult this problem can be elsewhere. In a domain with so much variability and flexibility as sketch recognition, how does one cope with these difficulties?

In 5 years, I will hopefully have finished my PhD and be earning a lot of money as a professor at some prestigious university. If I can't do that, I'll settle for making even more money in industry. :) Either way, I want to research and push the field and myself farther. I always want to learn, and I feel research is one of the best ways to do that (as opposed to slinging code from 8 to 5).

In 10 years, I hope to be tenured, well published, and really in my stride as a contributing professional in my research area (whatever that turns out to be). If I end up in industry rather than academia, I hope the only difference is that I don't have to worry about tenure.

When I'm not doing academic related things, I like spending time with my wife. We enjoy hiking, camping, playing racquetball, and watching movies together. Since I'm such a computer nerd, I also enjoy playing video games. My wife...not so much. :)

Fun Story: While in Shanghai, we had the unique experience of visiting a Chinese Wal-Mart. It was just that, an experience. For example, in their fresh seafood department, you could net-your-own catfish, eels, turtles, and frogs. In their fresh meat department, if you wanted a rack of spare ribs, you just picked one up out of the bin and chunked it in your buggy, plastic baggies or gloves were purely optional. Same thing with the bins of whole, defeathered chickens and chicken feet. It made me thankful for the USA, plastic wrap, and Styrofoam.