Monday, March 31, 2008

Mantyjarvi - Accelerometer DVD HMMs

Mantyjarvi, Jani, Juha Kela, Panu Korpipaa, and Sanna Kallio. "Enabling fast and effortless customisation in accelerometer based gesture interaction." MUM 2004.

Summary



Take accelerometer data. Segment the gesture by holding a button down during movement. Resample the gesture to 40 frames. Vector quantize (with k-means) the 40 3D points into 40 codewords (size of the codebook is 8). Plug the 40D vectors into ergodic, 5 state HMM and classify. Train HMMs until percent difference in log likelihood is behold a threshold.

Need more training data? Augment some of the training examples you do have with some noise, either uniformly or normally distributed. Signal to noise ratio of about 3 is best for Gaussian, 5 for uniform, and both slightly increase accuracy when used to generate training examples. Accuracy increases with more training examples. They get about 98% accuracy for their easy data set.

Discussion



Another paper that uses a ridiculously easy gesture set for use with powerful hidden Markov models. I think Rubine or $1 would do just as good, and wouldn't require the complexity of HMMs.

I do like the idea of generating new training examples by adding artificial noise. This can be useful when you don't have a lot of training data to begin with. However, I don't like the way they did it. They should be learning the parameters for their distributions by examining the real data. For example, using the real training examples, discover what the mean and covariance values should be. Then, sample this (these) distributions to get new training examples, rather than adding noise to a real training example (which will make outliers even worse). Also, it's not clear if there is any real advantage to using Gaussian over uniformly distributed noise. In Fig 6, Gaussian seems to do better for low SNR and uniform better for high SNR. And in Fig 7, the results are all over the place. Are the differences in accuracies statistically significant?

Wobbrock - $1

Wobbrock et al - $1 Recognizer

Sunday, March 30, 2008

Lieberman - TIKL

Lieberman, Jeff and Cynthia Breazeal. "TIKL: Development of a wearable vibrotactile feedback suit for improved human motor learning."

Summary



1) Teacher puts on this suit with VICON sensors and build in vibrating doo-has.
2) Teacher does a gesture and it is trained.
3) Give suit to student, adjust it etc.
4) Student does gesture, and suit vibrates to correct errors (difference in joint angles between teacher and student, multiplied by coefficient for amount of feedback). Can do cues to rotate and bend, etc.


Blah blah, graphs. Overall error is reduced and training time is reduced using suit compared to one without suit.

Discussion



Admitted flaws: cost of VICON and bulkiness/hassle of the half-body suit. The idea here is //REALLY// neat, using vibration feedback to correct gestures. Just not into it because there aren't any algorithms a machine learning person like me is into.

Lee - Neural Network Taiwanese Sign Language

Lee, Yung-Hui, and Cheng-Yueh Tsai. "Taiwan sign language (TSL) recognition based on 3D data and neural networks." Expert Systems with Applications (2007). doi:10.1016/j.eswa.2007.10.038

Summary



Vision based neural network posture recognition (20 static postures). Manual segmentation is used, and the recording is rigged to be nearly perfect postures. The hands/fingers are tracked by VICON system (3D coordinates) and filtered. The features that are computed are the distances between different landmarks on the hand, normalized to account for varying hand sizes. Features are fed into a neural network with 2 hidden layers, each with 250 hidden units, trained for 3000 epochs or until root mean squared error < 0.0001. About 95% accuracy.

Discussion



Holy overtraining, Batman! That's a lot of hidden units! Especially for a problem this set up to be easy...static postures...practiced until it was perfect...recognition should be close to perfect. Just do template matching. Also, don't include your training data in your test set.

Your features must really suck if you can't get closer to 100% on this problem (like > 99%). Even if you do pixel-by-pixel template matching, you should get pretty darn close to 99%. Heck, even handwritten digit recognition is close to 100%.

Patwardhan - Predictive EigenTracker

Patwardhan, K. S. and S. D. Roy. "Hand gesture modelling and recognition involving changing shapes and trajectories, using a Predictive EigenTracker." Pattern Recognition Letters 28 (2007), 329--34.

Summary



The authors seek to recognize dynamic hand gesture with changing shape as well as motion. They use principal components analysis (PCA) to get an eigenspace representation of the objects they wish to track (hands). Within the eigenspace, particle filtering is used to predict where the eigen-hands (hand image projected into eigenspace) will appear next. Skin color and motion cues are used to initialize the system automatically.

The EigenTracker is used to segment the hand motions (second paragraph of section 3) when "a drastic change in the appearance of the gesticulating hand, caused by the change in the hand shape, results in a large reconstruction error. This forces an epoch change, indicating an new shape of the gesticulating hand." The segments are used to create shape/motion pairs for the gesture. Trajectories are modeled with linear regression (least-squares linear approximation).

The tracked hand gestures are modeled as sequences of shape/movement pairs. The models are trained to get a mean gesture and covariace (Gaussian models), and the model with the smallest Mahalanobis distance to our training set is chosen as the classification label.

5 eigenvectors are used in PCA to capture 90% of the variance. Each gesture split into 2 epochs. Using Mahalanobis distance, they get 100% classification accuracy.


Discussion



They test with their training data, so this is crap. Also, their dataset is extremely simple, with very unique and defined shape/trajectory patterns. And, their background and image tracking is very clean (not a lot of noise) and too easy, as well. They say their data is easy to prove an optimal upper bound on classification accuracy...which turns out to be 100%. So, um, no duh? I'm going to make something impossible to classify and prove the lower bound is 0% (or at most 1/n, a random guess), sound good?

That said, I do like the way they use PCA to simplify the data and particle filtering to both track the hand and segment epochs. It's just their data sets that leave me feeling unimpressed.

Kratz - Wiizards

Kratz, Louis, MAtthew Smith, and Frank J. Lee. "Wiizards: 3D Gesture Recognition for Game Play Input." FuturePlay 2007.

Summary



So basically you have a Wii remote that takes {x,y,z} position every so often and generates a sequence of these positions. A hidden Markov model is trained on sequences and then used to classify a gesture (model with max Viterbi probability). As you increase number of HMM states and number of training examples, accuracy increases. Without user specific data, you get around 50% accuracy regardless. As you increase number of states, your system slows down.

Discussion



The application is neat, but all their results are of the "Duh" type. The game is neater than the implementation details, since you can combine spells and stuff for different effects.

How do they do segmentation?

Wednesday, March 19, 2008

Kato - Hand Tracking w/ PCA-ICA Approach

Kato, Makoto, Yen-Wei Chen, and Gang Xu. "Articulated Hand Tracking by PCA-ICA Approach." In Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR'06).

Summary



Kato et al. seek a way to represent hand data that is easier to handle. The problem they present is that hand-tracking data has too many dimensions and is difficult to handle in a feasible manner. They take motion data (bending each finger down to touch the palm) and split it into 100 time instants, with each instant containing bend data for 20 different joints in the hands. So each gesture (whole range of motion), is a 2000-dimension data vector (100 20-d vectors concatenated together).

They try to do feature extraction using PCA and ICA, both. They say ICA is better because it can extract the independent movement of the individual fingers, where PCA the movements of the fingers are not individual. Then they mention hand tracking using particle filtering, where we estimate the next position (?) of the hand using its current position.

Discussion



This paper has no clear purpose. I don't understand what the authors are trying to tell me. Because of that, I don't have much to offer that's not a rant.

PCA is not supposed to give you "feasible" hand positions. It tells you the directions of the highest variance.

Monday, March 17, 2008

Jenkins - ST-ISOMAP

Jenkins, O.C. and Mataric, M.J. "A spatio-temporal extension to Isomap nonlinear dimension reduction." ICML 2004.

Summary



Jenkins and Mataric present an extension to ISOMAP to take into consideration temporal data when constructing manifolds. ISOMAP is used to find embeddings in a high-dimensional space (manifolds) using geodesic distances and multi-dimensional scaling. ST-ISOMAP is an extension that uses temporal information. Items that are close to each other temporally have their spatial distances reduced.

The idea is that in some domains, like movement of an arm, things that are close together spatially might be quite different. For example, an arm moving one way might be very different than an arm moving the other way. The temporal differences between these gesture would be high because you'd arrive at the same spatial location via different temporal paths (sequences of arm locations). Likewise, seemingly different spatial locations might be very similar, and only 'close' to each other regarding temporal data (arm movements in the same direction but at different heights off the ground). ST-ISOMAP tries to capture these things.


Discussion



ISOMAP is a proven algorithm, and so is this extension for finding the manifold with temporal data. I think this could be useful for clustering of haptic gesture information. The high dimension space of the fingers+hand location could be reduced with ISOMAP into a simpler space where gestures could be segmented or classified more easily.

Maybe. Seems like a neat approach, anyhow. And ISOMAP is used for a /ton/ of stuff in machine learning, so it's not like this is a cheesy hack that no one really uses.

BibTeX



@inproceedings{jenkins2004ste,
title={{A spatio-temporal extension to Isomap nonlinear dimension reduction}},
author={Jenkins, O.C. and Matari{\'c}, M.J.},
journal={International Conference on Machine Learning},
year={2004},
publisher={ACM Press New York, NY, USA}
}