Wednesday, April 23, 2008

Eisenstein - discourse topic and gestural form

Jacob Eisenstein, Regina Barzilay, and Randall Davis. "Discourse Topic and Gestural Form." AAAI 2008.

Summary



The authors want to examine the relationship between gestures and meaning. They are looking for a correspondence between certain gesture and topic, irrespective of the "speaker" of the gesture. If gestures are speaker independent and depend only on topic, this can possibly improve gesture recognition accuracies.

They set up a topic-author model. Gesture features are extracted for a series of different conversations about different topics, where the speaker is making gestures to accompany his speech. They model gesture features with normals and topic/speaker gesture distributions with multinomials drawn from Dirichlet distributions (Dirichlet compound multinomial, or Polya distribution). Learning the parameters for their models, they use Bayesian inference and statistical significance tests to determine that 12% of all gestures belong to specific topics. Thus, if we have prior information about the topic (ie, speech), we can use contextual information to improve gesture recognition.

Discussion



The paper's purpose is to look for a link between gestures and topic. They find a link, but this isn't too surprising given their limited dataset. Furthermore, many of their videos (from which gestures and speech was extracted) were very limited in scope. It's my hypothesis that given a more general scope of data, the percentage of topic-specific gestures would drop.

It's true that about 10% of word occurrences (about 80% of the vocabulary, with numbers off the top of my head from memory) for large corpora are topic specific and are called content-carrying, since they can identify the topic of a document. However, I don't think there are that many gestures, and there is a great deal more reuse of gestures across topics.

Monday, April 14, 2008

Chang - Feature selection and grasp

Lillian Y. Chang, Nancy S. Pollard, Tom M. Mitchell, and Eric P. Xing. "Feature selection for grasp recognition from optical markers." Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems San Diego, CA, USA, Oct 29 - Nov 2, 2007.

Summary



31 markers on a hand that give (x,y,z) positions. Use stepwise forward and backward selection to pick a reduced subset of these markers. 5 markers give good accuracy, about 86% compared to 91% max accuracy of full set of markers.

Discussion



They use 6 grasp types, how easy/hard are they compared to the 14 types in Bernardin et al?

SFFS and SBFS are locally optimal, what about +L-R or bidirectional selection?

Fels - Glove-Talk II

S. Sidney Fels and Geoffrey E. Hinton. "Glove-TalkII—A Neural-Network Interface
which Maps Gestures to Parallel Formant Speech Synthesizer Controls." IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 1, JANUARY 1998.

Summary



CyberGlove with a Polhemus 6 DOF tracker, Contact Glove to measure contacts of fingers and thumbs, and a foot pedal. Three neural networks are implemented to extract speech parameters from the input devices and feed them to a speech synthesizer. Hand height controls pitch. Pedal controls volume.

A V/C network determines if the user is trying to make a vowel sound or a consonant sound. The inputs are finger flex values, with 5 sigmoid feed-forward hidden units, and the output is probability of making a vowel sound. Vowel is specified by user keepind all fingers unbent and the hand open.

One network determines the vowel sound the user is trying to make. Vowel sounds are determined by XY position in space, as measured by the Polhemus. An RBF network determines what position the user is in and outputs the appropriate vowel sound parameters for the speech synth.

The last network looks at the contact glove data, determining which fingers are touching the thumb. Consonant phonemes are mapped to different hand configurations and pattern-matched with the network. The input is flex values, the output is consonant speech synth parameters.

100 hours of training by one poor sap, who seemed to have provided 2000 examples of input, and he can produce "intelligible and somewhat natural sounding" speech, with the added bonus that he "finds it difficult to speak quickly, pronounce polysyllabic words, and speak spontaneously."

Discussion



First, a caveat. This is a neat idea. I've not seen gesture rec applied to something like this. As far as the idea goes, I'd give it an 8.5 / 10. And also, it's important to remember that humans, using their mouth parts and vocal tract, take what...5 years?...to learn how to produce decent speech. So of course something like this will come with the high cost of training.

Second, the approach here is poor. The system is far too complicated with all the pedals and hand-wavey motions. One obvious way to simplify it is to remove the second glove (Contact Glove) completely. The authors don't really say what it's used for, and it seems like it's not used for much, especially if the pedal can control stops, etc. For all the vowel and consonant networks, they're basically just performing a nearest neighbor lookup. Why don't they do that and make things much simpler? Perhaps the blending and smoothness of the speech parameters moving from one sound to another, as the network will provide function fitting and interpolation of values. But I think nearest neighbor would work well.

There are values to compute the centers and variances for the RBFs in their networks using the training data. No need for hand-picked or hard-coded values.

So if the idea gets 8.5/10, their execution gets a 3/10.

Wednesday, April 9, 2008

Kim - RFID target tracking

Kim, Myungsik, et al. "RFID-enabled target tracking and following with a mobile robot using direction finding antennas."

Summary



The authors propose a system for allowing a robot to obtain direction and follow/go to a target, either stationary or mobile, using RFID. The target has an RFID transponder. The robot has two antennae, perpendicular to each other, on a motor-mount so they can rotate independently of the robot. The antennae pick up different signals from the transponder, and can compute direction and distance based on intensity and signal strength ratio. By rotating the antenna array separately from the robot, they can avoid the problem of the robot freaking out in environments densely-populated with obstacles. It can average the signals over time as it rotates, then make a decision after the rotation.

Results: it can follow stuff.

Discussion



1) What is the system latency?

2) How well does it work "in real life" with a bunch of obstacles?

3) To use this for hand tracking, we'd put an RFID transponder on our hand, and the computer could track them. How accurate is it? The authors do say the signal ratio is not that great for accuracy ("This makes it difficult to precisely estimate the DOA directly from the ratio") because of noise. Is it centimeter/inch accurate, or is it crappy like the P5 glove? Is the best we can hope for a "Your hand is over there somewhere"?

Monday, April 7, 2008

Brashear - ASL game

Brashear, Helene, et al. "American sign language recognition in game development for deaf children." ASSETS 2006.

Summary



Two parts: 1) Wizard of Oz game for helping deaf kids to hearing parents (who presumably can't sign) learn sign language. 2) Recognition system for ASL words/sentences to automate the game's feedback.

The recognition system uses cameras, a colored glove, and accelerometers attached to the glove. The glove is colored to help image segmentation and hand tracking within the image. Data is automatically segmented at the sentence level with "push to sign" (click mouse to start, click to end). Image is converted to HSV histograms, which are enhanced with filtering. Image tracking is assisted using HSV values that are normalized based on new values and weighted old values (giving more mass to area where the hand was in the last frame). Features used are x, y, z of accelerometers, and vision data: change in x,y center position of hand, length of major/minor axes, eccentricity, orientation angle, direction of major axis in x,y offset. Data is classified with HMMs using GT2K. With 90/10 splits of random holdout set testing repeated 100 times (5 kids), they achieve 86% word accuracy on average for their user-independent models, and 61% sentence accuracy.

Discussion



Decent word accuracy. I think their HMM sentence accuracy was hurt by the fact that they did not have much training data. With more data, and with something a little more robust than GT2K, they might be able to do better. I don't like how they tried to pass off user-dependent results, since these are pretty worthless as you have to train per user. With user-dependent models, you can probably just use something akin to kNN and get close to 100% accuracy, since a user probably doesn't vary /too/ much from one instance to another.

Ogris - Ultrasonic and Manipulative Gestures

Ogris, Georg, et al. "Using ultrasonic hand tracking to augment motion analysis based recognition of manipulative gestures." ISWC 2005.

Summary



Wants to augment vision-based system with ultrasonic positioning system to determine what action is being performed on what tool/object in a workshop or something similar. They look at using model-based classification (series of data frames in sequence) with left-right HMMs. They look at frame-based classification using decision trees (C4.5) and kNN. They also examine methods of combining ultrasonic data to constrain the plausible classification results of the classifiers. They classify and get a ranked list, then pick the one that is most plausible given the ultrasonic data. If none are probably enough, ultrasonic data is said to be bad and most likely result of classification is chosen.

Using ultrasound alone we get 59% with C45 and 60% with kNN. We get 84% accuracy classifying frames of data with kNN. HMMs only perform with 65% accuracy, due to a lack of training data and longer, unstructured gestures. Using plausibility analysis, we can increase frame-based accuracy to 90%.

Discussion



I like that they use ultrasonics to get position data to help improve classification accuracy. But this doesn't seem like a groundbreaking addition. They just use a bunch of different classifiers and, gasp, find that contextual information (ultrasonic data) can improve classification accuracy.

Decent, but nothing ground breaking or surprising.

Sawada - Conducting and Tempo Prediction

Sawada, Hideyuki, and Shuji Hashimoto. "Gesture recognition using an acceleration sensor and its application to musical performance control." Electronics and Communications in Japan 80:5, 1997.

Summary



Use accelerometers and gyroscopes to get data on moving hand. Compute 2D acceleration vectors in XY, XZ, and YZ planes. One feature is the sum of changes in acceleration, another is the rotation of acceleration, and the third feature is the aspect ratio of the two unit components of each 2D acceleration vector (which acceleration component is larger). 8 more features gives the distributions of acceleration over eight principal directions with separation pi/4. These 11 features are computed for each of the three planes, giving 33 features per gesture. The mean and standard deviation for the features are computed, and classification is performed to the gesture with the lowest weighted error (sum of squared difference from mean divided by standard deviation).

They look at data to see where maxima in acceleration occur, representing places where a conductor changes direction, marking off tempo beats. To try and smooth the computer's performance with respect to changing/noisy tempo beats made by a human, the system uses prediction to guess the next set of tempo. A parameter can be set to change the system's reliance on the human compared to its ability to smooth out noisy tempo beats (linear prediction).

Discussion



They don't really explain their features well. Furthermore, they give this whole thing about rotation feature and then say they don't use it. Well big deal, then. Why list rotation as a feature?

They're note doing gesture recognition, just marking tempo beats using changes in acceleration. They don't need 33 features for this. They need 3--acceleration in X, Y, and Z. The rest are linearly dependent on the data. They can predict tempo fairly accurately, but I'm not that impressed.