Wednesday, February 27, 2008

Sagawa - Recognizing Sequence Japanese Sign Lang. Words

Sagawa, H. and Takeuchi, M. 2000. "A Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language Sentence." In Proceedings of the Fourth IEEE international Conference on Automatic Face and Gesture Recognition 2000 (March 26 - 30, 2000). FG. IEEE Computer Society, Washington, DC, 434.

Summary



The authors present a method for segmenting gestures in Japanese sign language. Using a set of 200 JSL sentences (100 for training and 100 for testing), they train a set of parameter thresholds. The thresholds are used to determine borders of signed words, if the word is one- or two-handed, and distinguish transitions from actual words.

They segment gestures using "hand velocity," which is the average change in position of all the hand parts from one point to the next. Minimal hand velocity (when all the parts are staying relatively still) is flagged as a possible segmentation point (i.e., Sezgin's speed points). Another candidate for segmentation points is a cosine metric, which measures the inner product of a hand's elements at a current point compared to a window +- n points. If the change in angle is above a threshold, flagged as a candidate (i.e., Sezgin's curvature points). Erroneous velocity candidates are thrown out if the change velocity change from (t-n to t) or (t to t+n) is not great enough.

Determination of which hands are used (both vs. one hand, right vs. left hand) is done by comparing the hand velocities of the two hands, both on "which max is greater" (Eq 3) and "avg squared difference in velocity >? 1" (Eq 4). Thresholds are trained to recognize these values.

Using their stuff, they segment words correctly 80% of the time, and misclassify transitions as words 11% of the time. They say they are able to improve classification accuracy of words (78 to 87) and sentences (56 to 58).

Discussion



So basically they're using Sezgin's methods. I don't like all the thresholds. They should have done something more statistically valid and robust, since this requires extensive training and is very training-set dependent. Furthermore, different signs and gestures seem like they will have different thresholds, so training on the whole set will make them always get segmented wrong. I guess this is why their accuracy is less than stellar.

Basically, they just look at which hand is moving more, or if both hands are moving about the same, to tell one/two-handed and right/left-handed. Meh. Not that impressed.

BibTeX



@inproceedings{796189,
author = {Hirohiko Sagawa and Masaru Takeuchi},
title = {A Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language Sentence},
booktitle = {FG '00: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition 2000},
year = {2000},
isbn = {0-7695-0580-5},
pages = {434},
publisher = {IEEE Computer Society},
address = {Washington, DC, USA},
}

Storring - Computer Vision-Based Gesture Rec for Augmented Reality

M. Störring, T.B. Moeslund, Y. Liu, and E. Granum. "Computer Vision-based Gesture Recognition for an Augmented Reality Interface." In Proceedings of 4th IASTED International Conference on Visualization, Imaging, and Image Processing. Marbella, Spain, Sep 2004: 766-71.

Summary



The authors present a vision-based system for simple posture recognition--a hand with 1-5 digits extended. They do skin color matching and segmentation to get a blob of the hand with normalized RGB values. These are then modeled as 2D Gaussians (chromaticity in r and g dimensions) and clustered. They choose the cluster corresponding to skin color defined with a certain number of pixels (min and max thresholds). Filtering is used to make the hand blobs continuous. They locate the center of the hand blob and using concentric circles with expanding radii, count the number of extended digits. This is their classification.

Discussion



If you have a tiny or huge hand, or the camera is zoomed in/out, your hand pixels may be too numerous/too sparse and not fall in their limits, so not get picked up correctly in the skin detection/hand tracking part of the algorithm.

Fingers must be spread enough for the concentric circle things to say there are two and not just 1.

I'd like details on how they find the center of the hand for their circles. I'd also like details on how they identify different fingers. For example, for their "click" gesture, do they just assume that a digit at 90 degrees to the hand that's seen/unseen/seen is a thumb moving? How do they sample the frames to get those three states?

First sentence: "less obtrusive." Figure 1: Crazy 50 pound head sucker thing. Give me my keyboard and mouse back.

BibTeX



@proceedings{storring2004visionGestureRecAR
,author="M. Störring and T.B. Moeslund and Y. Liu and E. Granum"
,title="Computer Vision-based Gesture Recognition for an Augmented Reality Interface"
,booktitle="4th IASTED International Conference on Visualization"
,address="Marbella, Spain"
,month="Sep"
,year="2004"
,pages="776--771"
}

Monday, February 25, 2008

Westeyn - GT2k

Westeyn, Tracy, Helene Brashear, Amin Atrash, and Thad Starner. "Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition." In Proceedings of the International Conference on Perceptive and Multimodal User Interfaces 2003 (ICMI), November 2003.

Summary



Westeyn et al. present a toolkit to simplify the recognition of hand gestures using hidden Markov models. Their system, dubbed GT2k, runs on top of the HMM toolkit used for speech recognition. It abstracts away the complexity of HMMs and the application to speech recognition (it's been shown that speech models do good recognizing gestures, too). You provide feature vectors, a grammar defining classification targets and how they are related, and trained examples. The system will train and can be used later for classification purposes.

They give several example applications which use the GT2k. The first recognizes postures performed between a camera and an array of IR LEDs, and achieves 99.2% accuracy for 8 classes. They also give examples of blink-prints, mobile sign language rec (90%) and workshop recognition.

Discussion



So first off, it's neat that there is a little toolkit thing we can use to do hand gesture recognition. Built on top of a HMM kit for speech recognition isn't too scary since HMMs pretty much pwn speech rec. It also makes HMMs more available to the masses.

That being said, I don't feel like the authors really applied their toolkit to any example that is truly worthy of the power of an HMM. The driving thing is a simple neural network and is crazy easy with even template matching. The blink print thing, besides being dumb, is just short/long sequence identification and template matching / nearest neighbor. Telesign... their grammar looks like you'd have to specify all possible orderings of words (UGH!). I think GT2K has promise in this area, however. Workshop activity recognition... besides the fact that the sensor data is able to classify activities, which is neat, this application is absurd.

However, again I'd like to clarify that the GT2K is a great idea and I'd like to use it more, hopefully with more worthy applications.

BibTeX



@inproceedings{958452,
author = {Tracy Westeyn and Helene Brashear and Amin Atrash and Thad Starner},
title = {Georgia tech gesture toolkit: supporting experiments in gesture recognition},
booktitle = {ICMI '03: Proceedings of the 5th international conference on Multimodal interfaces},
year = {2003},
isbn = {1-58113-621-8},
pages = {85--92},
location = {Vancouver, British Columbia, Canada},
doi = {http://doi.acm.org/10.1145/958432.958452},
publisher = {ACM},
address = {New York, NY, USA},
}

Friday, February 22, 2008

Lichtenauer - 3D Visual recog. of NGT sign production

J.F. Lichtenauer, G.A. ten Holt, E.A. Hendriks, M.J.T. Reinders. "3D Visual Detection of Correct NGT Sign Production." Thirteenth Annual Conference of the Advanced School for Computing and Imaging, Heijen, The Netherlands, June 13-15 2007.

Summary



Lichtenauer et al. discuss a system for recognizing Dutch sign language (NGT) gestures using two cameras and an image processing approach. The system is initialized semi-automatically to take the skin color of the face. The hands are then located in the image by finding the points that have the same color as the face. Gesture segmentation is manually enforced, with the hands at rest on the table between gestures. The gestures are turned into feature vectors of movement/angle through space (blob tracking) and time, and compared to a reference gesture per class using dynamic time warping. The features are classified independently of one another, and the results per class per feature are summed, giving one average probability per class (across the features). If the probability is above a certain threshold, the gesture is labeled as that class. They report 95% true positive rate and 5% false positive.

Discussion



This method seems pretty hardcore on the computation, since they're doing a classifier for each of the ~5000 features. I don't know if that's how all DTW stuff works, but I think you could do something to dramatically reduce the amount of error.

If you wear a short sleeve shirt, will the algorithm break if it starts trying to track your forearms or elbows? It's just using skin color, so I think it might.

They use the naive Bayes assumption to make all their features independent of each other. I think this is pretty safe to do, especially as it simplifies computation. They do mention that even though some features might contain correlation, they've added features to capture this correlation independently, and extract it out of the space "between" features (that's a hokey way to put it, sorry).

They don't report accuracy, but true positives. This is pretty much bogus, as far as I'm concerned, as it doesn't tell you much about how accurate their system is at recognizing gestures correctly.

BibTex



@proceedings{lightenauer2007ngtSign
,author="J.F. Lichtenauer and G.A. ten Holt and E.A. Hendriks and M.J.T. Reinders"
,title="{3D} Visual Detection of Correct {NGT} Sign Production"
,booktitle="Thirteenth Annual Conference of the Advanced School for Computing and Imaging"
,address="Heijen, The Netherlands"
,year="2007"
,month="June"
}

LaViola - Survey of Haptics

LaViola, J. J. 1999 A Survey of Hand Posture and Gesture Recognition Techniques and Technology. Technical Report. UMI Order Number: CS-99-11., Brown University.


Summary



We read only chapters 3 and 4. LaViola gives a nice summary over the many methods for haptic recognition and many domains where it can be used.

Template matching (like a $1 for haptics) is easy and has been implemented with good accuracy for small sets of gestures. Feature-based classifiers, like Rubine, have been used for very high accuracy rates, as well as segmentation of gestures. PCA can be used to form "eigenpostures" and to simplify data, possibly, for recognition. Obviously, as we've seen many times in class, neural networks and hidden Markov models can both be used to achieve high accuracy for complex data sets, but both require extensive training and some a priori knowledge of the data set (number of hidden layers/units and number of hidden states for nets and HMMs, respectively). Instance based learning, such as k-nearest neighbors, has also been briefly touched upon in the literature, but not much investigation has been performed. Other techniques, like using formal grammars to describe postures/gestures, are also discussed but not much work has been done in these areas.

The application domains for hand gesture recognition is basically all the stuff we've seen in class: sign language, virtual environments, controlling robots/computer systems, and 3D modelling.

Discussion



This was a very nice overview of the field. I'm most interested in exploring:

  • Template matching methods and feature based recognition (Sturman and Wexelblat)

  • PCA for gesture segmentation

  • Using a k-nearest neighbors approach to classification

  • Defining a constraint grammar to express a posture/gesture



All my ideas (except for the last) stem around the idea of representing a posture/gesture with a vector of features. Picking good features might be hard, as it is in sketch rec, but I think that it can be done (analogous to PaleoSketch).



BibTeX


@techreport{864649,
author = {Joseph J. LaViola, Jr.},
title = {A Survey of Hand Posture and Gesture Recognition Techniques and Technology},
year = {1999},
source = {http://www.ncstrl.org:8900/ncstrl/servlet/search?formname=detail\&id=oai%3Ancstrlh%3Abrowncs%3ABrownCS%2F%2FCS-99-11},
publisher = {Brown University},
address = {Providence, RI, USA},
}

Komura - Real-Time Locomotion w/ Data Gloves

Komura, T. and Lam, W. 2006. Real-time locomotion control by sensing gloves: Research Articles. Comput. Animat. Virtual Worlds 17, 5 (Dec. 2006), 513-525. DOI= http://dx.doi.org/10.1002/cav.v17:5

Summary



Komura and Lam present a method for mapping the movements of fingers (while wearing a data glove) and the hand into controlling a character in a 3D game, such as a character walking, running, hopping, and turning. They achieve this in two steps. The first is to give the user an on-screen example of a character walking, and then have the user mimic the action. During this stage, the data from the glove is calibrated to determine the periodicity of movements, etc, and a mapping from the finger movements to the movements of the character of the screen is made. The movement of each finger is compared to the movement of each end-point of the figure (legs, chest, etc), and the finger feature-vector that has the smallest angle to a given body part's feature vector (velocities and directions) is mapped to that body part. The mapping of function values (movement in finger to amount of movement in body) is simply made as the regression of a B-spline. The user can then move his hands around and make the character walk/run/jump with the mapped values. They perform a user study by making people run a character through a narrow set of passages, and find that users run through just as fast using keyboard vs. the glove, but tend to be more accurate and have fewer collisions with walls using the glove. They attribute this to the intuitive interface of using one's hands to control a figure.


Discussion



The one thing I did like about this paper was the way they mapped fingers to a set of pre-defined motions and then mapped the movements of the fingers to those of the characters. It seemed neat, but I don't think it has any research merit.

Why do you even have to learn anything? Everything they do is so rigid and pre-defined anyway, like the mapping of 2/4 legs of a dog to each finger with a set way for computing the period delay between front/rear legs. Why not just force a certain leg on the character to be a certain finger and avoid mapping altogether? Maybe you could still fit the B-spline to get a better idea of sensitivity, but the whole cosine thing is completely unnecessary.

They only use one test and it's very limited, so I don't think they can make the claim that "is is possible to conclude that the sensing glove controlling more effective when controlling the character more precisely." I also want standard deviation and significance levels for Tables 1 and 2, though for such a small sample size these might not be meaningful.

BibTeX



@article{1182569,
author = {Taku Komura and Wai-Chun Lam},
title = {Real-time locomotion control by sensing gloves: Research Articles},
journal = {Comput. Animat. Virtual Worlds},
volume = {17},
number = {5},
year = {2006},
issn = {1546-4261},
pages = {513--525},
doi = {http://dx.doi.org/10.1002/cav.v17:5},
publisher = {John Wiley and Sons Ltd.},
address = {Chichester, UK, UK},
}

Wednesday, February 20, 2008

Freeman - Television Control

Freeman, William T. and Craig D. Weissman. "Television Control by Hand Gestures." In Proceedings of the IEEE Intl. Wkshp. on Automatic Face and Gesture Recognition, Zurich, June, 1995.

Summary



Freeman and Weissman present a system for controlling a television (channel and volume) using "hand gestures." The hold up their hand in front of a television/computer combo. The computer recognizes an open hand using image processing techniques. When an open hand is seen, a menu opens with controls (buttons/slider bar) to control the channel and volume. They move their hand around and hover it over the controls to activate them. To stop, they close their hand or otherwise remove their open hand from the camera's FOV. They recognize an open palm using a cosine similarity metric (normalized correlation) between a pre-defined image of a palm and every possible offset within the image.

Discussion



Not in the mood to write decent prose, so here's a list.

  • Is natural language really that much better? First, it contains a lot of ambiguity that mouse/keyboard don't have. Second, you'd have just as many problems defining a vocabulary of commands using language as you would gestures, especially since there are so many words/synonyms/etc.

  • Their example of a 'complicated gesture' is a goat shadow puppet. Seriously? I think this is a little exaggerated and a lot ridiculous.

  • These aren't really gestures. It's just image tracking that boils down to nothing more than a mouse. What have you saved? Just buy 10 more remotes and glue them to things so you have one in every sitting spot and they can't be lost.

  • I don't know the image rec. research area, so I can't comment too much on their algorithm. But this seems like it would be super slow (taking all possible offsets) and have issues with scaling (what if the hand template is the wrong size, esp too small for the actual hand in the camera image).



BibTeX



@proceedings{freeman1995televisionGestures
,author="William T. Freeman and Craig D. Weissman"
,title="Television Control by Hand Gestures"
,booktitle="IEEE Intl. Wkshp. on Automatic Face and Gesture Recognition"
,address="Zurich"
,year="1995"
,month="June"
}

Wednesday, February 13, 2008

Marsh - Shape your imagination

Marsh, T.; Watt, A., "Shape your imagination: iconic gestural-based interaction," Virtual Reality Annual International Symposium, 1998. Proceedings., IEEE 1998 , vol., no., pp.122-125, 18-18 1998

Summary



Marsh and Watt present findings on a user study where they examine how gestures are used to describe objects in a non-verbal fashion. They describe how iconic gestures (those that immediately and clearly stand for something) fall into two camps: substitutive (hands match the shape or form of object) and virtual (outline or trace a picture of the shape/object).

For their study, they used 12 subjects of varying backgrounds. They had 15 shapes from two categories: primitive (circle, triangle, sphere, cube, etc) and complex (table, car, French baguette, etc). The shapes were written on index cards and presented in the same order to each user. The users then were told to describe the shapes using non-verbal communication. Of all the gestures, 75% were virtual. For the 2D shapes, 72% of people used one hand, while the 3D objects were all describe with 2 hands. For the complex shapes, iconic gestures were either replaced or accompanied by pantomimic (how the object was used) or deictic (pointing to something) gestures, rather than iconic ones. Some complex shapes were too difficult for users to express (4 for chair, 1 each for football, table, and baguette). They also discovered 2D is easier than 3D.

Discussion



I really liked this paper. While it was a little short, I think it was neat that they were able to break up the gestures that people made. This reminded me a lot of Alvarado et al's paper where they performed the user study about how people draw. I think it's especially useful to see that if we want to do anything useful with haptics, we have to enable the users to use /both/ hands.

Some things:

  • How did they pick their shapes, especially the complex ones? I mean, come on, French baguette? Although, this is a really good example because it's friggin hard to mime.

  • They note that most of the complex objects are too difficult to express with iconic gestures alone. That's why sign languages aren't that simple to learn. Not everything can be expressed easily with just iconic gestures. This paper was good that it pointed this out and made it clear, even though it seems obvious. It also seems to drive the need for multi-modal input for complex recognition domains.

  • They remark that 3D is harder than 2D. Besides the fact that this claim is obvious and almost a bit silly to make, it does seem that there are 2D shapes that would be very difficult to express. For example: Idaho. I wonder if their comparison between 2D and 3D here is a fair one. Obviously adding another dimension to things is going to make it exponentially more difficult, but they're comparing things like circle to things like French baguette.

  • Finally, who decides if a gesture is iconic or not? Isn't this shaped by experience and perception?



BibTeX



@ARTICLE{658465,
title={Shape your imagination: iconic gestural-based interaction},
author={Marsh, T. and Watt, A.},
journal={Virtual Reality Annual International Symposium, 1998. Proceedings., IEEE 1998},
year={18-18 1998},
volume={},
number={},
pages={122-125},
keywords={computer graphics, graphical user interfaces3D computer generated graphical environments, 3D spatial information, human computer interaction, iconic gestural-based interaction, iconic hand gestures, object manipulation, shape manipulation, spatial information},
doi={10.1109/VRAIS.1998.658465},
ISSN={1}, }

Kim - Gesture Rec. for Korean Sign Language

Jong-Sung Kim; Won Jang; Zeungnam Bien, "A dynamic gesture recognition system for the Korean sign language (KSL)," Systems, Man, and Cybernetics, Part B, IEEE Transactions on , vol.26, no.2, pp.354-359, Apr 1996

Summary



Kim et al. present a system for recognizing a subset of gestures in KSL. They say KSL can be expressed with 31 distinct gestures, choosing 25 of them to use in this initial study. They use a Data-Glove, which gives 2 bend values for each finger, and a Polhemus tracker (normal 6 DOF) to get information about each hand.

They recognize signs using the following recipe:

  1. Bin the movements along each axes (bins of width=4inches) to filter/smooth

  2. Vector quantize the movements of each hand into one of 10 "directional" classes that describe how the hand(s) is(are) moving.

  3. Feed the glove sensor information into a fuzzy min-max neural network and classify which of 14 postures it is, using a rejection threshold just in case it's not any of the postures

  4. Use the direction and posture results to say what the sign is intended to be



Discussion



They make the remark that many of the classes are not linearly separable. This is a problem in many domains. Support vector machines can sometimes do a very good job at separating data. I wonder why no one has used them so far. Probably because they're fairly complex.

I also like the idea of thinking as gestures as a signal. I don't know why, but this analogy has escaped me so far. There is a technique for detecting "interesting anomalies" in signals using PCA. I wonder if this would work in the segmentation problem?

How do they determine the initial position for the glove coordinates? If they get it wrong, all their measurements will be off and the vector quantization of their movements will probably fail. They should probably just skip this whole initial starting point thing and use change from the last position. Maybe that's what they really mean, but it's unclear.

Also, is seems like their method for filtering/smoothing the position/movement data by binning the values is a fairly hackish technique. There are robust methods for filtering noisy data that should have been used instead.

And finally for their results. They say 85%, which doesn't seem /too/ bad for a first try. But then they try to rationalize that 85% is good enough, saying that "the deaf-mute who [sic] use gestures often misunderstand each other." Well that's a little condescending, now, isn't it? And they also blame everything else besides their own algorithm, and "find that abnormal motions in the gestures and postures, and errors of sensors are partly responsible for the observed mis-classification." So you want things to work perfectly for you? You want life to play fair? News flash: if things were perfect, you would be out of the job and there would be no meaning or reason for the paper you just wrote. Things are hard. Deal with it. Life's not perfect, my sensors won't give me perfect data, and I can't draw a perfectly straight line by hand. That's not an excuse to not excel in your algorithm and account for those imperfections.

Also, how did they combine the movement quantized data (the ten movement classes) with the posture classifications? Postures were neural nets, not the combination, right?

BibTeX



@ARTICLE{485888,
title={A dynamic gesture recognition system for the Korean sign language (KSL)},
author={Jong-Sung Kim and Won Jang and Zeungnam Bien},
journal={Systems, Man, and Cybernetics, Part B, IEEE Transactions on},
year={Apr 1996},
volume={26},
number={2},
pages={354-359},
keywords={data gloves, fuzzy neural nets, pattern recognitionKorean sign language, data-gloves, dynamic gesture recognition system, fuzzy min-max neural network, online pattern recognition},
doi={10.1109/3477.485888},
ISSN={1083-4419}, }

Monday, February 11, 2008

Gesture descriptions for Trumpet Fingerings

I was going to do ASL fingerspelling as well, but others did it so I don't feel the need to repeat what they already said about it. Mine was basically exactly the same as theirs. So, instead, here's just the trumpet stuff.

trumpetFingeringGestures.pdf

Natural Language Descriptions for 5 easy ASL signs

Natural language descriptions of 5 common ASL signs
Joshua Johnston
Haptics
11 February, 2008

HELLO

Put the right hand into the shape of a “B”--all four fingers extended and placed together vertically from the palm, with the thumb bent and crossing the palm. Raise the b-hand and put the tip of the forefinger to your right temple. Move the b-hand away from the head to the right with a quick gesture.

(as if waving)

NAME


Put both hands into the sign for the letter “U”--the fore- and middle finger of each hand straight and extended from the palm and touching, the remaining fingers and thumb curled together in front of the palm (like a “2” or “scissors”) with the fingers together). Bring the u-hands in front of the body, fingers pointing parallel to the ground, in the shape of an “X.” Tap the right hand's extended fingers on top of the left hand's extended fingers twice.

(sign your hand on the “X”)

NICE

Put all the fingers (and thumb) together and extended on both hands to form a flat surface. Put the left hand palm up in front of the body and hold it still. Take the right hand and put its palm to the palm of your left hand, then with a smooth motion slide the right hand down the fingers of and off the left hand.

(this sign also means “clean” and “pure”, as if wiping the dirt off one hand with the other)

MEET

Put both hands into the sign for the letter “D”--the forefinger extended vertically with the rest of the fingers and thumb curled in front of the palm. Bring the d-hands together in front of the body, touching the curled fingers together.

(d-hands represent people coming together)

SANDWICH

Put both hands together, palms flat and fingers/thumb extended and together. Bring the tips of the fingers up and touch them to your mouth.

(hands are the bread, and you're eating it)

Cassandra - POMDPs

Anthony Cassandra. "A Survey of POMDP Applications." Presented at the AAAI Fall Symposium, 1998. http://pomdp.com/pomdp/papers/index.shtml, 11 Feb, 2008.

Summary



Not much to say about gesture recognition, which is not surprising since POMDPs are used for artificial intelligence in the area of planning. Think of a robot that has a goal and only a limited visual range (can't see behind obstructions, etc.). A POMDP might be used in this situation to evaluate different actions to take based on the current state of things.

The paper does mention machine vision and gesture recognition. The context here is that the computer uses a POMDP to focus a camera and a fovea (high resolution area for fine-grained vision) on facial expressions, hand movements, etc. The fovea is important because it is limited, and the areas outside it either have a much lower resolution (to reduce computational burden) or cannot be seen at all (outside the FOV).

Discussion



I really don't think POMDPs can be used for our purposes in gesture recognition.

However, this is a nice paper if you want examples of how POMDPs can be used in multiple domains.

That is all.

BibTeX



@UNPUBLISHED{cassandra1998pomdps
,author={Anthony Cassandra}
,title={A Survey of POMDP Applications}
,year={1998}
,note={Presented at the AAAI Fall Symposium}
}

Song - Forward Spotting Accumulative HMMs

Daehwan Kim; Daijin Kim, "An Intelligent Smart Home Control Using Body Gestures," Hybrid Information Technology, 2006. ICHIT'06. Vol 2. International Conference on , vol.2, no., pp.439-446, Nov. 2006

Summary



Song and Kim present an algorithm for segmenting a stream of gestures and recognizing the segmented gestures. They take a sliding window of postures (observations) from the stream and feed them into a HMM system that has one model per gesture class, and one "Non-Gesture" HMM. They say that a gesture has started if the max probability from one of the gesture HMMs is greater than the probability of the Non-Gesture HMM. They call this the competitive differential observation probability, which is the difference between the max gesture prob and the non-gesture prob (positive means gesture, negative means non-gesture, and crossing 0 means starting/ending a gesture).

One a gesture is observed to have started, they begin classifying the gesture segments (segmenting the segmented gesture). They feed the segments into the HMMs and get classifications for each segment. Once the gesture is found to have terminated (the CDOP drops below 0, or the gesture stream becomes a non-gesture), they look at the classification results for all the segments and take a majority vote to determine the class for the whole gesture.

So we have a sliding window. Within that window, we decide a gesture starts and later see that it ends. Between the start and end points, we segment the gesture stream further. Say there are 3 segments. Then we'd classify {1}, {12}, and {123}. Pretend {1} and {123} were "OPEN CURTAINS" and {12} was "CLOSE CURTAINS." The majority vote, after the end of the gesture, would rule the gesture as "OPEN CURTAINS."

They give some results, which seem to show their automatic method performs better than a manual method, but it's not clear what the manual method is. They seem to get about 95% accuracy classifying 8 gestures made with the arms to open/close curtains and turn on/off the lights.

Discussion



So basically they just use a probabilistic significance threshold to say if a gesture has started or not, as determined by the classification of an observation as a non-gesture (like Iba's use of a wait state when recognizing robot gestures). So don't call it the CDOP. Call it a "junk" class or "non-gesture" class. They made it much harder to understand than it is.

When they give their results in Figure 5 and show the curves for manual segmentation, what the heck does \theta mean? This wasn't explained and makes their figure all but useless.

So this seems like a decent method for segmenting gestures...10 years ago. Iba had almost the exact same thing in his robot gesture recognition system, and I'm sure he wasn't the first. Decent results, I think (can't really interpret their graph), but nothing really noteworthy.

The only thing they do differently is do a majority vote from the sub-segmentation of their segmenting. Yeah, confusing. I'm not sure how much this improves recognition, as they did not compare with/without it. It seems to me like it would only take up more computation time for gains that weren't that significant.

BibTeX


@ARTICLE{song2006forwardSpottingAccumulativeHMM,
title={An Intelligent Smart Home Control Using Body Gestures},
author={Daehwan Kim and Daijin Kim},
journal={Hybrid Information Technology, 2006. ICHIT'06. Vol 2. International Conference on},
year={Nov. 2006},
volume={2},
number={},
pages={439-446},
doi={10.1109/ICHIT.2006.253644},
ISSN={}, }

Friday, February 8, 2008

Ip - Cyber Composer

Ip, H.H.S.; Law, K.C.K.; Kwong, B., "Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation," Multimedia Modelling Conference, 2005. MMM 2005. Proceedings of the 11th International , vol., no., pp. 46-52, 12-14 Jan. 2005

Summary



Ip et al. describe their Cyber Composer system. The system uses rules of music theory and gesture recognition to allow users to create dynamic music with the use of hand gestures. The system allows for the control of tempo/rhythm, pitch, dynamics/volume, and even the use of a second instrument and harmony. With the help of various theory rules, including chord progression and harmonics, they assert their system can produce "arousing" musical pieces.

Discussion



Not much is given in the way of technical details (well, nothing, actually) this is a good proof of concept. To me, the gestures seem intuitive, even if they are a little convoluted as the same type of gesture may do many things based on context. This would be a good class project, I think, with a little more gesture recognition and control over the final product. Maybe more like a real composer, where different instrument groups are located in space, and you can point at them and direct them to modify group dynamics. Who knows.

BibTeX



@ARTICLE{ip2005cyberComposer,
title={Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation},
author={ Ip, H.H.S. and Law, K.C.K. and Kwong, B.},
journal={Multimedia Modelling Conference, 2005. MMM 2005. Proceedings of the 11th International},
year={12-14 Jan. 2005},
volume={},
number={},
pages={ 46-52},
doi={10.1109/MMMC.2005.32},
ISSN={1550-5502 }, }

Wednesday, February 6, 2008

Li - SImilarity Measure (SVD angular) for Stream Segmentation and Recognition

Li, C. and Prabhakaran, B. 2005. A similarity measure for motion stream segmentation and recognition. In Proceedings of the 6th international Workshop on Multimedia Data Mining: Mining integrated Media and Complex Data (Chicago, Illinois, August 21 - 21, 2005). MDM '05. ACM, New York, NY, 89-94. DOI= http://doi.acm.org/10.1145/1133890.1133901

Summary



Li and Prabhakaran propose a new gesture classification algorithm that is easily generalizable to many input methods. They use the SVD of a motion matrix. A motion matrix has columns that are the features of the data (like the joint measurements from a CyberGlove) and rows that are steps through time. SVD is a mathematical procedure that produces a set of eigenvectors and eigenvalues for the matrix (they use matrix M = A'A, where A is the motion matrix, for computational efficiency). The top k eigenvectors are used (a parameter, with empirical evidence supporting k=6 as enough to perform well). To compare to motion matrices, the eigenvectors are compared with their dot product (angle between the vectors), weighted by the ratio of the eigenvalues for those vectors. A value of 0 means that the matrices have nothing in common, as all the eigenvectors are orthogonal. A value of 1 means the matrices have collinear eigenvectors. They call this kWAS, k weighted angular similarity (for k eigenvectors, and the weighted dot product/cosine metric).

Their algorithm works as follows. Start with a library of matrices and compute the eigenvectors/values for them. Start watching the stream of incoming data, segmenting it with minimum length l and max length L, stepping through the stream with steps size \delta. Look at all the chunks in the stream, call the matrices Q, and compare them to all the P. The Q,P pairing that has the highest kWAS score is selected as the correct answer, and the classification starts from the end of the segment with the max score.

They report that their algorithm can recognize CyberGlove gestures (not clear if it's isolated patterns or streams) with 99% accuracy with k=3, and in motion capture data with 100% accuracy with k=4. These figures aren't clear as to what they mean, however.


Discussion



So their method isn't really for segmentation. They still just look at different sliding windows of data and pick one that works. It works well without the use of holding positions or neutral states, as many other systems impose on users to delineate gestures. However, Iba et al's system can do the same thing using hidden Markov models with a built in wait state.

However, as far as a new classification approach is conerened, this is a nice approach because it seems to give decent results and is not another HMM.

They never say how they pick delta. I wonder how different values affect accuracy / running time of the algorithm.

Some people might be concerned with the fact that once you do the eigenvectors, you lose temporal information. I can see where this would be a concern for some things. However, most of the time you can get good classification/clustering results without the need for perfect temporal information. It can even be the case that temporal information tends to confuse the issue, making things hard to compute and compare.

BibTeX



@inproceedings{1133901,
author = {Chuanjun Li and B. Prabhakaran},
title = {A similarity measure for motion stream segmentation and recognition},
booktitle = {MDM '05: Proceedings of the 6th international workshop on Multimedia data mining},
year = {2005},
isbn = {1-59593-216-X},
pages = {89--94},
location = {Chicago, Illinois},
doi = {http://doi.acm.org/10.1145/1133890.1133901},
publisher = {ACM},
address = {New York, NY, USA},
}

Hernandez-Rebollar - Accelerometers and Decision Tree for ASL

Hernandez-Rebollar, J.L.; Lindeman, R.W.; Kyriakopoulos, N., "A multi-class pattern recognition system for practical finger spelling translation," Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on , vol., no., pp. 185-190, 2002

Summary



Rebollar et al. present a new algorithm for classification of ASL finger spelling letters (J and Z, the only letters that move, are statically signed at the ending posture of the gesture). They create their own glove so they don't have to sink a lot of money in to the expensive options currently available. Their gloves uses 5 accelerometers, one per finger, that measure in two axes. The y axis is aligned to point at the tip of each finger, and measures flexion and pitch. The x axis gives an idea about roll, yaw, and abduction.

They take the ten measurement values (two axes per finger, 5 fingers) and convert them to a 3D vector. The first dim is the sum of the x-axis values, the second is the y-axis, and the third is the y-axis value of the index finger, which they claim is adequate for describing the bentness of the palm.

The 3D vector is fed into a decision tree. For 21/26 letters, 5 signers doing 10 reps of each letter, they get 100% accuracy. For the I and Y, they get 96%. For U,V, and R, the accuracy is 90%, 78%, and 96%.

Discussion



Again, another paper where they sum all their values to get a global picture. This is a horrible idea as fingers will mask each other. At least sum the square of the values, so you can see if some are really high compared to others. Or, better, yet, use the 10 dimensions for the decision tree. It's really not that hard.

It was nice to see something besides an HMM, and they do get pretty good results. However, I'm ready for J and Z to move.

I also like their hardware approach. Seems simple and a lot less expensive than dropping 10-30K on a CyberGlove.

BibTeX



@ARTICLE{rebollar2002multiClassFingerSpelling,
title={A multi-class pattern recognition system for practical finger spelling translation},
author={Hernandez-Rebollar, J.L. and Lindeman, R.W. and Kyriakopoulos, N.},
journal={Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on},
year={2002},
volume={},
number={},
pages={ 185-190},
doi={10.1109/ICMI.2002.1166990},
}

Harling - Hand Tension for Segmentation

Philip A. Harling and Alistair D. N. Edwards. Hand tension as a gesture segmentation cue. In Philip A. Harling and Alistair D. N. Edwards, editors, Progress in Gestural Interaction: Proceedings of Gesture Workshop '96, pages 75--87, Springer, Berlin et al., 1997.

Summary



Harling and Edwards address the problem of segmenting gestures in a stream of data from a power glove. Their assumption is that when we purposefully want our hands to convey information, they will be tense. When the hand is "limp", the user is not trying to convey information.

Tension is measured by imagining rubber bands attached to the tip of the finger, one parallel to the x axis and the other to the y-axis. The rubber bands have certain elastic moduli, and the tension in the system can be solved with physics equations. To get an idea of overall hand tension, the values for each finger are summed.

They evaluate their idea by examining two different sayings in British sign language: "My name" and "My name me". They find dips in the 'tension graph' between each gesture, and claim an algorithm could segment at these points of low tension.

Discussion



Seems pretty nice. It's good to have an idea of what we can do to solve the segmentation issue. However, I wonder if some gestures are performed with a "limp" hand. Their idea of tension is maximized when the finger is either fully extended or fully closed, so anything where the finger is halfway will not work. Also, perhaps you naturally stand with your hand clenched in your relaxed position, so non-gestures would be tense.

I don't like that they sum the tension in each finger to get a total hand tension. I think we need information per finger, otherwise it seems like you could miss fingers moving in ways that kept the tension at the same level.

Their testing was /not/ very thorough. Another poor results section.

BibTeX



@inproceedings{harling1996handTensionSegmentation
,author = "Philip A. Harling and Alistair D. N. Edwards"
,title = "Hand Tension as a Gesture Segmentation Cue"
,booktitle = "Gesture Workshop"
,pages = "75-88"
,year = "1996"
}

Lee - Interactive Learning HMMs

Lee, Christopher, and Yangsheng Xu. "Online, Interactive Learning of Gesture of Human/Robot Interfaces."

Okay, right off the bat, this paper has nothing to do with robots. Why put it in the title?

Summary



Lee and Xu present an algorithm for classifying gestures with HMMs, evaluating the confidence of each classification, and using correct classifications to update the parameters of the HMM. The user has to wait for a bit between gestures to aid in segmentation. To simplify the data, they use fast Fourier transforms (FFTs) on a sliding window of sensor data from the glove to collapse the window. They then feed the FFT results to vector quantization (using an off-line codebook generated with LBG) to collapse the vector to a one dimension symbol. The series of symbols are fed into HMMs, one per class, and the class with the highest Pr(O|model) is selected as answer. The gesture is then folded into the training set for that HMM and the parameters are updated.

They also introduce a confidence measure for analyzing their system's performance, which is the log of the sum of the all ratios of an incorrect HMMs prob for a gesture / the corrent HMMs prob for a gesture. If a gesture is classified correctly, the correct HMM will have a higher prob than all the incorrect HMMs and all the ratios will be < 1, meaning the log of the sum of them will be < 0. If all the probabilites are about the same, the classifier is unsure and the ratios will all be around 1, meaning the log will be around 0. They show that starting with one training example, they achieve high and confident classification after only a few testing examples are classified and used to update the system.

However, they're only using a set of ASL letters that are "amenable to VQ clustering."

Discussion



I do like the idea of online training an updating of the model. However, after a few users, you lose the benefit so it's just better to have a good set of training data that's used offline before any recognition takes place, simplifying your system and reducing workload.

I don't like that you have to hold your hand still for a little bit between gestures. I would have liked to seen a system like the "wait state" HMM system discussed in Iba, et al. "An architecture for gesture based control of mobile robots." I'd like to see a better handle on the segmentation problem. They do mention using acceleration.

Their training set is too small and easy, picking things that are "amenable to VQ clustering", so I don't give their system much credit.

Monday, February 4, 2008

Chen - Dynamic Gesture Interface w/ HMMs

Chen, Qing.... "A Dynamic Gesture Interface for Virtual Environments Based on Hiddean Markov Models." HAVE 2005

Summary



Chen et al. use hidden Markov models (HMMs) to classify gestures (their focus is a simple domain of three gestures). The algorithm they use captures the standard deviation of the different bend sensor values on a glove, with the argument/idea that using the std., they don't have to worry about segmentation. They feed the std data into HMMs and classify like that. Their three gestures are very simple and are used to control three axes of rotation for a virtual, 3D cube.

They give no recognition results.

Discussion



I'm not sure these guys are too well versed in machine learning. This paper is pretty weak. I'll just make a laundry list instead of trying to tie all my complaints together in prose.


  • They mention other approaches (Kalman filters, dynamic time warping, FSM) that have been used, but state they have "very strict assumptions." Okay, like what? Kalman filters and hidden Markov models pretty much do the exact same thing, so why will HMMs do better than Kalman filters?

  • They say (page 2, first par.) that gestures are noisy and even if a person does it the same way, it will still be different. Duh. Too bad. Measurements and data are noisy, just like everything in machine learning. Otherwise, you'd just look it up in a hash table and save yourself a lot of trouble.

  • It's the Expectation-MAXIMIZATION algorithm, not -Modification.

  • They claim to avoid the need for segmentation. Okay, then what are you computing the standard deviation of? You have to have some sort of window of points to do the calculations on. I suppose their assumption is they just get the gesture in a window, not half of one, and things happen by magic.



Weak paper. Do not want. Would not buy from seller again.