jbjohns - Haptics and Sketch Recognition: 2008

Saturday, May 10, 2008

Eisenstein - Device independence and extensibility

Eisenstein, J.; Ghandeharizadeh, S.; Golubchik, L.; Shahabi, C.; Donghui Yan; Zimmermann, R., "Device independence and extensibility in gesture recognition," Virtual Reality, 2003. Proceedings. IEEE , vol., no., pp. 207-214, 22-26 March 2003

Summary

They're trying to make a recognition framework that is device independent, so you can plug any data into it from any device. The first layer is the raw data, and each sensor value is application dependent. The second layer is a set of predicates that combine raw values into postures. The third layers is a set of temporal predicates that describe changes in posture over time. The fourth layers is a set of gestural templates that assign temporal patterns of postures to specific gestures.

Sensor values from layer 1 are mapped to predicates by hand for each device. Templates are added once, combining predicates. Use a bunch of neural networks to combine predicates into temporal data, and a bunch more neural networks to combine temporal data into gestures. Train all these networks with a whole lot of data.

Discussion

They omit stuff with temporal data from their experiments, so only ASL letters excluding Z and J. This is pretty cheesy. Also, they are shooting for device independence but you still have to map and train the connections between raw sensor values and predicates by hand, for each device. I understand you'd have to do this for any application, but it seems to defeat their purpose. I guess their benefits come at the higher levels, where you use predicates regardless of how they were constructed.

This seems crazy complicated for /very/ little accuracy. Using neural networks to classify the static ASL letters (all but Z and J), they only get 67\% accuracy. Other approaches are able to get close to 95-99% for the same data. I guess things are a little too complex.

Schlomer - Gesture Rec with Wii Controller

Schlomer, Poppinga, Henze, Boll. Gesture Recognition with a Wii Controller (TEI 2008).

Summary

Wiimote. Acceleration data. Quantize the (x,y,z) acceleration data using k-means into a codebook of size 14. Plug the quantized data in to a left-right HMM. Segment gestures by making the user press and hold the A button during a gesture. They can recognize about 90% of all gestures accurately (circle, square, tennis swing).

Discussion

The point of this is.... what? Wii games, like Wii tennis and bowling, are pretty darned accurate at this already.

Murayama - Spidar G&G

Summary

We have a 6 DOF haptics device called the SPIDAR G that provides force feedback. Let's put two of them together, one for each hand. We'll make people do different things, like use one hand to manipulate a target and another to manipulates an object with have to put into the target. If they run into something, we'll give them feedback to say they hit something. When they use one Spidar as opposed to just 2, they can usually do things faster.

Discussion

So all they do is take one machine and hook another up and use the two of them. And using two things, users can tend to do things a little bit faster. This is pretty obvious, since instead of just moving the object, I can now also move the target. It saves the amount of work I have to do.

I wonder if the strings of the Spidar would get in your way and limit your movement. Surely they would. Rotation would also be tough because you can't hold onto something and rotate it more than about 180 degrees.

No real evaluation performed, just a little bit of speedup data.

Lapides - 3D Tractus

Lapides, P., Sharlin, E., Sousa, M. C., and Streit, L. 2006. The 3D Tractus: A Three-Dimensional Drawing Board. In Proceedings of the First IEEE international Workshop on Horizontal interactive Human-Computer Systems (January 05 - 07, 2006). TABLETOP. IEEE Computer Society, Washington, DC, 169-176. DOI= http://dx.doi.org/10.1109/TABLETOP.2006.33

Summary

I want to draw in 3D, but all I have is this tablet PC. I know, I'll put the table PC on a table that moves up and down, simulating the third dimension. I'll make a drawing program with a user interface that shows the 3D drawing from a few angles so users won't get too confused. We'll use visual cues like depth with line width. Also, we'll use perspective projection so the user knows which things are 'below' the current plane of the tablet PC.

People used it and said it was neat.

Discussion

This drawing/3D modeling approach is a little more realistic than other things we've read (Holosketch or the superellipsoid clay thing) since all you need is a tablet and one of their funky elevator things. So I'll give it kudos there.

I'm not sure how simple or intuitive it is, however, to have to move your tablet pc up and down to draw in the third dimension. I question the accuracy, especially if you're trying to line things up one on top of another, etc, since you don't really have a good idea of where things are in the Z axis. This is especially hard if what you're trying to draw exists above the current plane of the tablet PC, since their software only shows you what's below the current plane.

Neat idea, but a bit klunky. Hope someone doesn't get their legs cut off if the tractus goes berserk.

Krahnstoever - Activity Recognition with Vision and RFID

Krahnstoever, N.; Rittscher, J.; Tu, P.; Chean, K.; Tomlinson, T., "Activity Recognition using Visual Tracking and RFID," Application of Computer Vision, 2005. WACV/MOTIONS '05 Volume 1. Seventh IEEE Workshops on , vol.1, no., pp.494-500, 5-7 Jan. 2005

Summary

Person in an office or warehouse with cameras on them. Track their movements with a monte carlo model examining the image frames. Augment this with RFID tags embedded in all the objects the human can interact with. Do activity recognition by examining how the person is moving (vision) and what they are interacting with (rfid). RFID helps augment visual tracking for the purposes of activity recognition.

Discussion

So they take an existing Monte Carlo visual tracking algorithm and magically throw RFID in to the jar. They say this does better. Sort of a "duh" moment. Why do we let Civil Josh pick papers?

Bernardin - Grasping HMMs

Bernardin, K., K. Ogawara, et al. (2005). "A sensor fusion approach for recognizing continuous human grasping sequences using hidden Markov models." Robotics, IEEE Transactions on [see also Robotics and Automation, IEEE Transactions on] 21(1): 47-57.

Summary

I have a robot that I want to teach to grab things. I can teach it by example. I have 14 different types of grips that I use everyday. I'll put pressure sensors in a glove under the CyberGlove. I will grab something, and then let it go. All of this data will be fed into an HMM in the HTK speech recognition toolkit. The HMM will tell me which grasp I am making with up to about 90% accuracy.

Discussion

Pretty neat. If you know what you're grasping, you can do things like activity recognition and such. Especially helpful when you start using smart rooms and offices, etc. Maybe even Information Oriented Programming (IOP)!

I think the pressure sensors really helped augment the CyberGlove, especially since there were so many grasp categories.

Nishino - Object modelling with gestures

Nishino, H., Utsumiya, K., and Korida, K. 1998. 3D object modeling using spatial and pictographic gestures. In Proceedings of the ACM Symposium on Virtual Reality Software and Technology (Taipei, Taiwan, November 02 - 05, 1998). VRST '98. ACM, New York, NY, 51-58. DOI= http://doi.acm.org/10.1145/293701.293708

Summary

Put on special glasses to get a 3d stereoscopic image from a curved screen, and put glove/motion tracker on your hands to track them. Have some virtual clay modelled by a superellipsoid (it's mathematically easy to work with, relatively). Create a blob, deform it, mash it, pinch it, stretch it, put it in a pan, bake it up as fast as you can. Combine a bunch of blobs to make things like teapots, vases, and bigger blobs.

Discussion

Good for professional sculptors who might want to fashion something without wasting real clay. But, since clay is easy to recycle (just add water), who cares. If you're not a sculptor, are you good enough with your hands to make your blobs of junk look like things in real life? How accurate are the hands, so a noisy spike doesn't accidentally mash your teapot into oblivion?

Pretty neat idea, just not sure of its usefulness.

Campbell - Invariant Features, Tai-Chi

Campbell L W, Becker D A, Azarbayejani A, Bobick A F, and Pentland A, Invariant Features for 3-D Gesture Recognition, Proc. of FG'96 (1996) 157-162.

Summary

Look at a series of gestures occurring in Tai-Chi captured by video. Extract a lot of features about the gestures, including plain (x,y,z) coordinates, velocities for these coords, polar coordinates, polar velocities. Do each of these with and without head data (always with hand data). Plug all the different sets of features into an HMM and see which feature set does the best. Polar velocity with no head does the best at about 95% accuracy overall. Plain (x,y,z) does the worst at about 34% overall.

Discussion

Just take a bunch of features and string them all together. Perform a standard feature extraction/selection algorithm. Get a set of features that probably outperforms all your sets.

Win.

This paper isn't interesting, really, as it just shotguns a bunch of features into an HMM and see who wins.

Fail.

Wesche - FreeDrawer

Wesche, G. and Seidel, H. 2001. FreeDrawer: a free-form sketching system on the responsive workbench. In Proceedings of the ACM Symposium on Virtual Reality Software and Technology (Baniff, Alberta, Canada, November 15 - 17, 2001). VRST '01. ACM, New York, NY, 167-174. DOI= http://doi.acm.org/10.1145/505008.505041

Summary

Electronic pen you can use to draw in 3D space. To make things simpler for their algorithm, you're restricted to spline curves. You trace out the general curve with the pen and the computer calculates the parameters of the spline. You can draw curves, modify them, connect curves together to form a network, fill in surfaces between curves. You wear wonky VR goggles to see what you're drawing.

Discussion

Tradeoff between user freedom (virtual clay) and performance--they choose performance by limiting a user's drawing style (restricted to splines). They claim this is easy because it has closed form representation, is easily transferable (just the parameters of the splines and not every voxel need to be transmitted), and computationally cheap (storing every voxel for virtual clay is expensive).

They admit you need an artistic flair and a little bit of training to get used to using the splines. Well then why not just train on a CAD system? Isn't the point to offer an intuitive interface with no need for training or restrictions? Plus, if you use CAD, you don't have to use /just/ splines, can be precise and exact, and don't have to wear wonky 3D goggles.

Poddar - Gesture Spech, Weatherman

I. Poddar, Y. Sethi, E. Ozyildiz, R. Sharma. Toward Natural Gesture/Speech HCI: A Case Study of Weather Narration. Proc. Workshops on Perceptual User Interfaces, pages 1-6, November, 1998.

Summary

Three categories of gestures: pointing, area, and contour, each with three phases: preparing, making the gesture, and retraction. Use features that measure distances/angles between the face and hands and plug into an HMM. Get 50-60% accuracy on four test sequences.

Now add speech to the gesture data. Compute co-occurrences of marker words with different gestures and use the data to help the HMM classify gestures. Accuracy goes up about 10%.

Discussion

Adding speech to gesture data improves the accuracy. This is fairly obvious, and they've shown that it does a little bit. The one thing I don't like is the manual labeling of speech data.

I wish they would have done more gestures, and their accuracies weren't great. But at least it was a fusion of contextual data.

Wednesday, April 23, 2008

Eisenstein - discourse topic and gestural form

Jacob Eisenstein, Regina Barzilay, and Randall Davis. "Discourse Topic and Gestural Form." AAAI 2008.

Summary

The authors want to examine the relationship between gestures and meaning. They are looking for a correspondence between certain gesture and topic, irrespective of the "speaker" of the gesture. If gestures are speaker independent and depend only on topic, this can possibly improve gesture recognition accuracies.

They set up a topic-author model. Gesture features are extracted for a series of different conversations about different topics, where the speaker is making gestures to accompany his speech. They model gesture features with normals and topic/speaker gesture distributions with multinomials drawn from Dirichlet distributions (Dirichlet compound multinomial, or Polya distribution). Learning the parameters for their models, they use Bayesian inference and statistical significance tests to determine that 12% of all gestures belong to specific topics. Thus, if we have prior information about the topic (ie, speech), we can use contextual information to improve gesture recognition.

Discussion

The paper's purpose is to look for a link between gestures and topic. They find a link, but this isn't too surprising given their limited dataset. Furthermore, many of their videos (from which gestures and speech was extracted) were very limited in scope. It's my hypothesis that given a more general scope of data, the percentage of topic-specific gestures would drop.

It's true that about 10% of word occurrences (about 80% of the vocabulary, with numbers off the top of my head from memory) for large corpora are topic specific and are called content-carrying, since they can identify the topic of a document. However, I don't think there are that many gestures, and there is a great deal more reuse of gestures across topics.

Monday, April 14, 2008

Chang - Feature selection and grasp

Lillian Y. Chang, Nancy S. Pollard, Tom M. Mitchell, and Eric P. Xing. "Feature selection for grasp recognition from optical markers." Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems San Diego, CA, USA, Oct 29 - Nov 2, 2007.

Summary

31 markers on a hand that give (x,y,z) positions. Use stepwise forward and backward selection to pick a reduced subset of these markers. 5 markers give good accuracy, about 86% compared to 91% max accuracy of full set of markers.

Discussion

They use 6 grasp types, how easy/hard are they compared to the 14 types in Bernardin et al?

SFFS and SBFS are locally optimal, what about +L-R or bidirectional selection?

Fels - Glove-Talk II

S. Sidney Fels and Geoffrey E. Hinton. "Glove-TalkII—A Neural-Network Interface
which Maps Gestures to Parallel Formant Speech Synthesizer Controls." IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 1, JANUARY 1998.

Summary

CyberGlove with a Polhemus 6 DOF tracker, Contact Glove to measure contacts of fingers and thumbs, and a foot pedal. Three neural networks are implemented to extract speech parameters from the input devices and feed them to a speech synthesizer. Hand height controls pitch. Pedal controls volume.

A V/C network determines if the user is trying to make a vowel sound or a consonant sound. The inputs are finger flex values, with 5 sigmoid feed-forward hidden units, and the output is probability of making a vowel sound. Vowel is specified by user keepind all fingers unbent and the hand open.

One network determines the vowel sound the user is trying to make. Vowel sounds are determined by XY position in space, as measured by the Polhemus. An RBF network determines what position the user is in and outputs the appropriate vowel sound parameters for the speech synth.

The last network looks at the contact glove data, determining which fingers are touching the thumb. Consonant phonemes are mapped to different hand configurations and pattern-matched with the network. The input is flex values, the output is consonant speech synth parameters.

100 hours of training by one poor sap, who seemed to have provided 2000 examples of input, and he can produce "intelligible and somewhat natural sounding" speech, with the added bonus that he "finds it difficult to speak quickly, pronounce polysyllabic words, and speak spontaneously."

Discussion

First, a caveat. This is a neat idea. I've not seen gesture rec applied to something like this. As far as the idea goes, I'd give it an 8.5 / 10. And also, it's important to remember that humans, using their mouth parts and vocal tract, take what...5 years?...to learn how to produce decent speech. So of course something like this will come with the high cost of training.

Second, the approach here is poor. The system is far too complicated with all the pedals and hand-wavey motions. One obvious way to simplify it is to remove the second glove (Contact Glove) completely. The authors don't really say what it's used for, and it seems like it's not used for much, especially if the pedal can control stops, etc. For all the vowel and consonant networks, they're basically just performing a nearest neighbor lookup. Why don't they do that and make things much simpler? Perhaps the blending and smoothness of the speech parameters moving from one sound to another, as the network will provide function fitting and interpolation of values. But I think nearest neighbor would work well.

There are values to compute the centers and variances for the RBFs in their networks using the training data. No need for hand-picked or hard-coded values.

So if the idea gets 8.5/10, their execution gets a 3/10.

Wednesday, April 9, 2008

Kim - RFID target tracking

Kim, Myungsik, et al. "RFID-enabled target tracking and following with a mobile robot using direction finding antennas."

Summary

The authors propose a system for allowing a robot to obtain direction and follow/go to a target, either stationary or mobile, using RFID. The target has an RFID transponder. The robot has two antennae, perpendicular to each other, on a motor-mount so they can rotate independently of the robot. The antennae pick up different signals from the transponder, and can compute direction and distance based on intensity and signal strength ratio. By rotating the antenna array separately from the robot, they can avoid the problem of the robot freaking out in environments densely-populated with obstacles. It can average the signals over time as it rotates, then make a decision after the rotation.

Results: it can follow stuff.

Discussion

1) What is the system latency?

2) How well does it work "in real life" with a bunch of obstacles?

3) To use this for hand tracking, we'd put an RFID transponder on our hand, and the computer could track them. How accurate is it? The authors do say the signal ratio is not that great for accuracy ("This makes it difficult to precisely estimate the DOA directly from the ratio") because of noise. Is it centimeter/inch accurate, or is it crappy like the P5 glove? Is the best we can hope for a "Your hand is over there somewhere"?

Monday, April 7, 2008

Brashear - ASL game

Brashear, Helene, et al. "American sign language recognition in game development for deaf children." ASSETS 2006.

Summary

Two parts: 1) Wizard of Oz game for helping deaf kids to hearing parents (who presumably can't sign) learn sign language. 2) Recognition system for ASL words/sentences to automate the game's feedback.

The recognition system uses cameras, a colored glove, and accelerometers attached to the glove. The glove is colored to help image segmentation and hand tracking within the image. Data is automatically segmented at the sentence level with "push to sign" (click mouse to start, click to end). Image is converted to HSV histograms, which are enhanced with filtering. Image tracking is assisted using HSV values that are normalized based on new values and weighted old values (giving more mass to area where the hand was in the last frame). Features used are x, y, z of accelerometers, and vision data: change in x,y center position of hand, length of major/minor axes, eccentricity, orientation angle, direction of major axis in x,y offset. Data is classified with HMMs using GT2K. With 90/10 splits of random holdout set testing repeated 100 times (5 kids), they achieve 86% word accuracy on average for their user-independent models, and 61% sentence accuracy.

Discussion

Decent word accuracy. I think their HMM sentence accuracy was hurt by the fact that they did not have much training data. With more data, and with something a little more robust than GT2K, they might be able to do better. I don't like how they tried to pass off user-dependent results, since these are pretty worthless as you have to train per user. With user-dependent models, you can probably just use something akin to kNN and get close to 100% accuracy, since a user probably doesn't vary /too/ much from one instance to another.

Ogris - Ultrasonic and Manipulative Gestures

Ogris, Georg, et al. "Using ultrasonic hand tracking to augment motion analysis based recognition of manipulative gestures." ISWC 2005.

Summary

Wants to augment vision-based system with ultrasonic positioning system to determine what action is being performed on what tool/object in a workshop or something similar. They look at using model-based classification (series of data frames in sequence) with left-right HMMs. They look at frame-based classification using decision trees (C4.5) and kNN. They also examine methods of combining ultrasonic data to constrain the plausible classification results of the classifiers. They classify and get a ranked list, then pick the one that is most plausible given the ultrasonic data. If none are probably enough, ultrasonic data is said to be bad and most likely result of classification is chosen.

Using ultrasound alone we get 59% with C45 and 60% with kNN. We get 84% accuracy classifying frames of data with kNN. HMMs only perform with 65% accuracy, due to a lack of training data and longer, unstructured gestures. Using plausibility analysis, we can increase frame-based accuracy to 90%.

Discussion

I like that they use ultrasonics to get position data to help improve classification accuracy. But this doesn't seem like a groundbreaking addition. They just use a bunch of different classifiers and, gasp, find that contextual information (ultrasonic data) can improve classification accuracy.

Decent, but nothing ground breaking or surprising.

Sawada - Conducting and Tempo Prediction

Sawada, Hideyuki, and Shuji Hashimoto. "Gesture recognition using an acceleration sensor and its application to musical performance control." Electronics and Communications in Japan 80:5, 1997.

Summary

Use accelerometers and gyroscopes to get data on moving hand. Compute 2D acceleration vectors in XY, XZ, and YZ planes. One feature is the sum of changes in acceleration, another is the rotation of acceleration, and the third feature is the aspect ratio of the two unit components of each 2D acceleration vector (which acceleration component is larger). 8 more features gives the distributions of acceleration over eight principal directions with separation pi/4. These 11 features are computed for each of the three planes, giving 33 features per gesture. The mean and standard deviation for the features are computed, and classification is performed to the gesture with the lowest weighted error (sum of squared difference from mean divided by standard deviation).

They look at data to see where maxima in acceleration occur, representing places where a conductor changes direction, marking off tempo beats. To try and smooth the computer's performance with respect to changing/noisy tempo beats made by a human, the system uses prediction to guess the next set of tempo. A parameter can be set to change the system's reliance on the human compared to its ability to smooth out noisy tempo beats (linear prediction).

Discussion

They don't really explain their features well. Furthermore, they give this whole thing about rotation feature and then say they don't use it. Well big deal, then. Why list rotation as a feature?

They're note doing gesture recognition, just marking tempo beats using changes in acceleration. They don't need 33 features for this. They need 3--acceleration in X, Y, and Z. The rest are linearly dependent on the data. They can predict tempo fairly accurately, but I'm not that impressed.

Monday, March 31, 2008

Mantyjarvi - Accelerometer DVD HMMs

Mantyjarvi, Jani, Juha Kela, Panu Korpipaa, and Sanna Kallio. "Enabling fast and effortless customisation in accelerometer based gesture interaction." MUM 2004.

Summary

Take accelerometer data. Segment the gesture by holding a button down during movement. Resample the gesture to 40 frames. Vector quantize (with k-means) the 40 3D points into 40 codewords (size of the codebook is 8). Plug the 40D vectors into ergodic, 5 state HMM and classify. Train HMMs until percent difference in log likelihood is behold a threshold.

Need more training data? Augment some of the training examples you do have with some noise, either uniformly or normally distributed. Signal to noise ratio of about 3 is best for Gaussian, 5 for uniform, and both slightly increase accuracy when used to generate training examples. Accuracy increases with more training examples. They get about 98% accuracy for their easy data set.

Discussion

Another paper that uses a ridiculously easy gesture set for use with powerful hidden Markov models. I think Rubine or $1 would do just as good, and wouldn't require the complexity of HMMs.

I do like the idea of generating new training examples by adding artificial noise. This can be useful when you don't have a lot of training data to begin with. However, I don't like the way they did it. They should be learning the parameters for their distributions by examining the real data. For example, using the real training examples, discover what the mean and covariance values should be. Then, sample this (these) distributions to get new training examples, rather than adding noise to a real training example (which will make outliers even worse). Also, it's not clear if there is any real advantage to using Gaussian over uniformly distributed noise. In Fig 6, Gaussian seems to do better for low SNR and uniform better for high SNR. And in Fig 7, the results are all over the place. Are the differences in accuracies statistically significant?

Wobbrock - $1

Wobbrock et al - $1 Recognizer

Sunday, March 30, 2008

Lieberman - TIKL

Lieberman, Jeff and Cynthia Breazeal. "TIKL: Development of a wearable vibrotactile feedback suit for improved human motor learning."

Summary

1) Teacher puts on this suit with VICON sensors and build in vibrating doo-has.
2) Teacher does a gesture and it is trained.
3) Give suit to student, adjust it etc.
4) Student does gesture, and suit vibrates to correct errors (difference in joint angles between teacher and student, multiplied by coefficient for amount of feedback). Can do cues to rotate and bend, etc.

Blah blah, graphs. Overall error is reduced and training time is reduced using suit compared to one without suit.

Discussion

Admitted flaws: cost of VICON and bulkiness/hassle of the half-body suit. The idea here is //REALLY// neat, using vibration feedback to correct gestures. Just not into it because there aren't any algorithms a machine learning person like me is into.

Lee - Neural Network Taiwanese Sign Language

Lee, Yung-Hui, and Cheng-Yueh Tsai. "Taiwan sign language (TSL) recognition based on 3D data and neural networks." Expert Systems with Applications (2007). doi:10.1016/j.eswa.2007.10.038

Summary

Vision based neural network posture recognition (20 static postures). Manual segmentation is used, and the recording is rigged to be nearly perfect postures. The hands/fingers are tracked by VICON system (3D coordinates) and filtered. The features that are computed are the distances between different landmarks on the hand, normalized to account for varying hand sizes. Features are fed into a neural network with 2 hidden layers, each with 250 hidden units, trained for 3000 epochs or until root mean squared error < 0.0001. About 95% accuracy.

Discussion

Holy overtraining, Batman! That's a lot of hidden units! Especially for a problem this set up to be easy...static postures...practiced until it was perfect...recognition should be close to perfect. Just do template matching. Also, don't include your training data in your test set.

Your features must really suck if you can't get closer to 100% on this problem (like > 99%). Even if you do pixel-by-pixel template matching, you should get pretty darn close to 99%. Heck, even handwritten digit recognition is close to 100%.

Patwardhan - Predictive EigenTracker

Patwardhan, K. S. and S. D. Roy. "Hand gesture modelling and recognition involving changing shapes and trajectories, using a Predictive EigenTracker." Pattern Recognition Letters 28 (2007), 329--34.

Summary

The authors seek to recognize dynamic hand gesture with changing shape as well as motion. They use principal components analysis (PCA) to get an eigenspace representation of the objects they wish to track (hands). Within the eigenspace, particle filtering is used to predict where the eigen-hands (hand image projected into eigenspace) will appear next. Skin color and motion cues are used to initialize the system automatically.

The EigenTracker is used to segment the hand motions (second paragraph of section 3) when "a drastic change in the appearance of the gesticulating hand, caused by the change in the hand shape, results in a large reconstruction error. This forces an epoch change, indicating an new shape of the gesticulating hand." The segments are used to create shape/motion pairs for the gesture. Trajectories are modeled with linear regression (least-squares linear approximation).

The tracked hand gestures are modeled as sequences of shape/movement pairs. The models are trained to get a mean gesture and covariace (Gaussian models), and the model with the smallest Mahalanobis distance to our training set is chosen as the classification label.

5 eigenvectors are used in PCA to capture 90% of the variance. Each gesture split into 2 epochs. Using Mahalanobis distance, they get 100% classification accuracy.

Discussion

They test with their training data, so this is crap. Also, their dataset is extremely simple, with very unique and defined shape/trajectory patterns. And, their background and image tracking is very clean (not a lot of noise) and too easy, as well. They say their data is easy to prove an optimal upper bound on classification accuracy...which turns out to be 100%. So, um, no duh? I'm going to make something impossible to classify and prove the lower bound is 0% (or at most 1/n, a random guess), sound good?

That said, I do like the way they use PCA to simplify the data and particle filtering to both track the hand and segment epochs. It's just their data sets that leave me feeling unimpressed.

Kratz - Wiizards

Kratz, Louis, MAtthew Smith, and Frank J. Lee. "Wiizards: 3D Gesture Recognition for Game Play Input." FuturePlay 2007.

Summary

So basically you have a Wii remote that takes {x,y,z} position every so often and generates a sequence of these positions. A hidden Markov model is trained on sequences and then used to classify a gesture (model with max Viterbi probability). As you increase number of HMM states and number of training examples, accuracy increases. Without user specific data, you get around 50% accuracy regardless. As you increase number of states, your system slows down.

Discussion

The application is neat, but all their results are of the "Duh" type. The game is neater than the implementation details, since you can combine spells and stuff for different effects.

How do they do segmentation?

Wednesday, March 19, 2008

Kato - Hand Tracking w/ PCA-ICA Approach

Kato, Makoto, Yen-Wei Chen, and Gang Xu. "Articulated Hand Tracking by PCA-ICA Approach." In Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR'06).

Summary

Kato et al. seek a way to represent hand data that is easier to handle. The problem they present is that hand-tracking data has too many dimensions and is difficult to handle in a feasible manner. They take motion data (bending each finger down to touch the palm) and split it into 100 time instants, with each instant containing bend data for 20 different joints in the hands. So each gesture (whole range of motion), is a 2000-dimension data vector (100 20-d vectors concatenated together).

They try to do feature extraction using PCA and ICA, both. They say ICA is better because it can extract the independent movement of the individual fingers, where PCA the movements of the fingers are not individual. Then they mention hand tracking using particle filtering, where we estimate the next position (?) of the hand using its current position.

Discussion

This paper has no clear purpose. I don't understand what the authors are trying to tell me. Because of that, I don't have much to offer that's not a rant.

PCA is not supposed to give you "feasible" hand positions. It tells you the directions of the highest variance.

Monday, March 17, 2008

Jenkins - ST-ISOMAP

Jenkins, O.C. and Mataric, M.J. "A spatio-temporal extension to Isomap nonlinear dimension reduction." ICML 2004.

Summary

Jenkins and Mataric present an extension to ISOMAP to take into consideration temporal data when constructing manifolds. ISOMAP is used to find embeddings in a high-dimensional space (manifolds) using geodesic distances and multi-dimensional scaling. ST-ISOMAP is an extension that uses temporal information. Items that are close to each other temporally have their spatial distances reduced.

The idea is that in some domains, like movement of an arm, things that are close together spatially might be quite different. For example, an arm moving one way might be very different than an arm moving the other way. The temporal differences between these gesture would be high because you'd arrive at the same spatial location via different temporal paths (sequences of arm locations). Likewise, seemingly different spatial locations might be very similar, and only 'close' to each other regarding temporal data (arm movements in the same direction but at different heights off the ground). ST-ISOMAP tries to capture these things.

Discussion

ISOMAP is a proven algorithm, and so is this extension for finding the manifold with temporal data. I think this could be useful for clustering of haptic gesture information. The high dimension space of the fingers+hand location could be reduced with ISOMAP into a simpler space where gestures could be segmented or classified more easily.

Maybe. Seems like a neat approach, anyhow. And ISOMAP is used for a /ton/ of stuff in machine learning, so it's not like this is a cheesy hack that no one really uses.

BibTeX

@inproceedings{jenkins2004ste,
title={{A spatio-temporal extension to Isomap nonlinear dimension reduction}},
author={Jenkins, O.C. and Matari{\'c}, M.J.},
journal={International Conference on Machine Learning},
year={2004},
publisher={ACM Press New York, NY, USA}
}

Wednesday, February 27, 2008

Sagawa - Recognizing Sequence Japanese Sign Lang. Words

Sagawa, H. and Takeuchi, M. 2000. "A Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language Sentence." In Proceedings of the Fourth IEEE international Conference on Automatic Face and Gesture Recognition 2000 (March 26 - 30, 2000). FG. IEEE Computer Society, Washington, DC, 434.

Summary

The authors present a method for segmenting gestures in Japanese sign language. Using a set of 200 JSL sentences (100 for training and 100 for testing), they train a set of parameter thresholds. The thresholds are used to determine borders of signed words, if the word is one- or two-handed, and distinguish transitions from actual words.

They segment gestures using "hand velocity," which is the average change in position of all the hand parts from one point to the next. Minimal hand velocity (when all the parts are staying relatively still) is flagged as a possible segmentation point (i.e., Sezgin's speed points). Another candidate for segmentation points is a cosine metric, which measures the inner product of a hand's elements at a current point compared to a window +- n points. If the change in angle is above a threshold, flagged as a candidate (i.e., Sezgin's curvature points). Erroneous velocity candidates are thrown out if the change velocity change from (t-n to t) or (t to t+n) is not great enough.

Determination of which hands are used (both vs. one hand, right vs. left hand) is done by comparing the hand velocities of the two hands, both on "which max is greater" (Eq 3) and "avg squared difference in velocity >? 1" (Eq 4). Thresholds are trained to recognize these values.

Using their stuff, they segment words correctly 80% of the time, and misclassify transitions as words 11% of the time. They say they are able to improve classification accuracy of words (78 to 87) and sentences (56 to 58).

Discussion

So basically they're using Sezgin's methods. I don't like all the thresholds. They should have done something more statistically valid and robust, since this requires extensive training and is very training-set dependent. Furthermore, different signs and gestures seem like they will have different thresholds, so training on the whole set will make them always get segmented wrong. I guess this is why their accuracy is less than stellar.

Basically, they just look at which hand is moving more, or if both hands are moving about the same, to tell one/two-handed and right/left-handed. Meh. Not that impressed.

BibTeX

@inproceedings{796189,
author = {Hirohiko Sagawa and Masaru Takeuchi},
title = {A Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language Sentence},
booktitle = {FG '00: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition 2000},
year = {2000},
isbn = {0-7695-0580-5},
pages = {434},
publisher = {IEEE Computer Society},
address = {Washington, DC, USA},
}

Storring - Computer Vision-Based Gesture Rec for Augmented Reality

M. Störring, T.B. Moeslund, Y. Liu, and E. Granum. "Computer Vision-based Gesture Recognition for an Augmented Reality Interface." In Proceedings of 4th IASTED International Conference on Visualization, Imaging, and Image Processing. Marbella, Spain, Sep 2004: 766-71.

Summary

The authors present a vision-based system for simple posture recognition--a hand with 1-5 digits extended. They do skin color matching and segmentation to get a blob of the hand with normalized RGB values. These are then modeled as 2D Gaussians (chromaticity in r and g dimensions) and clustered. They choose the cluster corresponding to skin color defined with a certain number of pixels (min and max thresholds). Filtering is used to make the hand blobs continuous. They locate the center of the hand blob and using concentric circles with expanding radii, count the number of extended digits. This is their classification.

Discussion

If you have a tiny or huge hand, or the camera is zoomed in/out, your hand pixels may be too numerous/too sparse and not fall in their limits, so not get picked up correctly in the skin detection/hand tracking part of the algorithm.

Fingers must be spread enough for the concentric circle things to say there are two and not just 1.

I'd like details on how they find the center of the hand for their circles. I'd also like details on how they identify different fingers. For example, for their "click" gesture, do they just assume that a digit at 90 degrees to the hand that's seen/unseen/seen is a thumb moving? How do they sample the frames to get those three states?

First sentence: "less obtrusive." Figure 1: Crazy 50 pound head sucker thing. Give me my keyboard and mouse back.

BibTeX

@proceedings{storring2004visionGestureRecAR
,author="M. Störring and T.B. Moeslund and Y. Liu and E. Granum"
,title="Computer Vision-based Gesture Recognition for an Augmented Reality Interface"
,booktitle="4th IASTED International Conference on Visualization"
,address="Marbella, Spain"
,month="Sep"
,year="2004"
,pages="776--771"
}

Monday, February 25, 2008

Westeyn - GT2k

Westeyn, Tracy, Helene Brashear, Amin Atrash, and Thad Starner. "Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition." In Proceedings of the International Conference on Perceptive and Multimodal User Interfaces 2003 (ICMI), November 2003.

Summary

Westeyn et al. present a toolkit to simplify the recognition of hand gestures using hidden Markov models. Their system, dubbed GT²k, runs on top of the HMM toolkit used for speech recognition. It abstracts away the complexity of HMMs and the application to speech recognition (it's been shown that speech models do good recognizing gestures, too). You provide feature vectors, a grammar defining classification targets and how they are related, and trained examples. The system will train and can be used later for classification purposes.

They give several example applications which use the GT²k. The first recognizes postures performed between a camera and an array of IR LEDs, and achieves 99.2% accuracy for 8 classes. They also give examples of blink-prints, mobile sign language rec (90%) and workshop recognition.

Discussion

So first off, it's neat that there is a little toolkit thing we can use to do hand gesture recognition. Built on top of a HMM kit for speech recognition isn't too scary since HMMs pretty much pwn speech rec. It also makes HMMs more available to the masses.

That being said, I don't feel like the authors really applied their toolkit to any example that is truly worthy of the power of an HMM. The driving thing is a simple neural network and is crazy easy with even template matching. The blink print thing, besides being dumb, is just short/long sequence identification and template matching / nearest neighbor. Telesign... their grammar looks like you'd have to specify all possible orderings of words (UGH!). I think GT2K has promise in this area, however. Workshop activity recognition... besides the fact that the sensor data is able to classify activities, which is neat, this application is absurd.

However, again I'd like to clarify that the GT2K is a great idea and I'd like to use it more, hopefully with more worthy applications.

BibTeX

@inproceedings{958452,
author = {Tracy Westeyn and Helene Brashear and Amin Atrash and Thad Starner},
title = {Georgia tech gesture toolkit: supporting experiments in gesture recognition},
booktitle = {ICMI '03: Proceedings of the 5th international conference on Multimodal interfaces},
year = {2003},
isbn = {1-58113-621-8},
pages = {85--92},
location = {Vancouver, British Columbia, Canada},
doi = {http://doi.acm.org/10.1145/958432.958452},
publisher = {ACM},
address = {New York, NY, USA},
}

Friday, February 22, 2008

Lichtenauer - 3D Visual recog. of NGT sign production

J.F. Lichtenauer, G.A. ten Holt, E.A. Hendriks, M.J.T. Reinders. "3D Visual Detection of Correct NGT Sign Production." Thirteenth Annual Conference of the Advanced School for Computing and Imaging, Heijen, The Netherlands, June 13-15 2007.

Summary

Lichtenauer et al. discuss a system for recognizing Dutch sign language (NGT) gestures using two cameras and an image processing approach. The system is initialized semi-automatically to take the skin color of the face. The hands are then located in the image by finding the points that have the same color as the face. Gesture segmentation is manually enforced, with the hands at rest on the table between gestures. The gestures are turned into feature vectors of movement/angle through space (blob tracking) and time, and compared to a reference gesture per class using dynamic time warping. The features are classified independently of one another, and the results per class per feature are summed, giving one average probability per class (across the features). If the probability is above a certain threshold, the gesture is labeled as that class. They report 95% true positive rate and 5% false positive.

Discussion

This method seems pretty hardcore on the computation, since they're doing a classifier for each of the ~5000 features. I don't know if that's how all DTW stuff works, but I think you could do something to dramatically reduce the amount of error.

If you wear a short sleeve shirt, will the algorithm break if it starts trying to track your forearms or elbows? It's just using skin color, so I think it might.

They use the naive Bayes assumption to make all their features independent of each other. I think this is pretty safe to do, especially as it simplifies computation. They do mention that even though some features might contain correlation, they've added features to capture this correlation independently, and extract it out of the space "between" features (that's a hokey way to put it, sorry).

They don't report accuracy, but true positives. This is pretty much bogus, as far as I'm concerned, as it doesn't tell you much about how accurate their system is at recognizing gestures correctly.

BibTex

@proceedings{lightenauer2007ngtSign
,author="J.F. Lichtenauer and G.A. ten Holt and E.A. Hendriks and M.J.T. Reinders"
,title="{3D} Visual Detection of Correct {NGT} Sign Production"
,booktitle="Thirteenth Annual Conference of the Advanced School for Computing and Imaging"
,address="Heijen, The Netherlands"
,year="2007"
,month="June"
}

LaViola - Survey of Haptics

LaViola, J. J. 1999 A Survey of Hand Posture and Gesture Recognition Techniques and Technology. Technical Report. UMI Order Number: CS-99-11., Brown University.

Summary

We read only chapters 3 and 4. LaViola gives a nice summary over the many methods for haptic recognition and many domains where it can be used.

Template matching (like a $1 for haptics) is easy and has been implemented with good accuracy for small sets of gestures. Feature-based classifiers, like Rubine, have been used for very high accuracy rates, as well as segmentation of gestures. PCA can be used to form "eigenpostures" and to simplify data, possibly, for recognition. Obviously, as we've seen many times in class, neural networks and hidden Markov models can both be used to achieve high accuracy for complex data sets, but both require extensive training and some a priori knowledge of the data set (number of hidden layers/units and number of hidden states for nets and HMMs, respectively). Instance based learning, such as k-nearest neighbors, has also been briefly touched upon in the literature, but not much investigation has been performed. Other techniques, like using formal grammars to describe postures/gestures, are also discussed but not much work has been done in these areas.

The application domains for hand gesture recognition is basically all the stuff we've seen in class: sign language, virtual environments, controlling robots/computer systems, and 3D modelling.

Discussion

This was a very nice overview of the field. I'm most interested in exploring:

Template matching methods and feature based recognition (Sturman and Wexelblat)

PCA for gesture segmentation

Using a k-nearest neighbors approach to classification

Defining a constraint grammar to express a posture/gesture

All my ideas (except for the last) stem around the idea of representing a posture/gesture with a vector of features. Picking good features might be hard, as it is in sketch rec, but I think that it can be done (analogous to PaleoSketch).

BibTeX

@techreport{864649,
author = {Joseph J. LaViola, Jr.},
title = {A Survey of Hand Posture and Gesture Recognition Techniques and Technology},
year = {1999},
source = {http://www.ncstrl.org:8900/ncstrl/servlet/search?formname=detail\&id=oai%3Ancstrlh%3Abrowncs%3ABrownCS%2F%2FCS-99-11},
publisher = {Brown University},
address = {Providence, RI, USA},
}

Komura - Real-Time Locomotion w/ Data Gloves

Komura, T. and Lam, W. 2006. Real-time locomotion control by sensing gloves: Research Articles. Comput. Animat. Virtual Worlds 17, 5 (Dec. 2006), 513-525. DOI= http://dx.doi.org/10.1002/cav.v17:5

Summary

Komura and Lam present a method for mapping the movements of fingers (while wearing a data glove) and the hand into controlling a character in a 3D game, such as a character walking, running, hopping, and turning. They achieve this in two steps. The first is to give the user an on-screen example of a character walking, and then have the user mimic the action. During this stage, the data from the glove is calibrated to determine the periodicity of movements, etc, and a mapping from the finger movements to the movements of the character of the screen is made. The movement of each finger is compared to the movement of each end-point of the figure (legs, chest, etc), and the finger feature-vector that has the smallest angle to a given body part's feature vector (velocities and directions) is mapped to that body part. The mapping of function values (movement in finger to amount of movement in body) is simply made as the regression of a B-spline. The user can then move his hands around and make the character walk/run/jump with the mapped values. They perform a user study by making people run a character through a narrow set of passages, and find that users run through just as fast using keyboard vs. the glove, but tend to be more accurate and have fewer collisions with walls using the glove. They attribute this to the intuitive interface of using one's hands to control a figure.

Discussion

The one thing I did like about this paper was the way they mapped fingers to a set of pre-defined motions and then mapped the movements of the fingers to those of the characters. It seemed neat, but I don't think it has any research merit.

Why do you even have to learn anything? Everything they do is so rigid and pre-defined anyway, like the mapping of 2/4 legs of a dog to each finger with a set way for computing the period delay between front/rear legs. Why not just force a certain leg on the character to be a certain finger and avoid mapping altogether? Maybe you could still fit the B-spline to get a better idea of sensitivity, but the whole cosine thing is completely unnecessary.

They only use one test and it's very limited, so I don't think they can make the claim that "is is possible to conclude that the sensing glove controlling more effective when controlling the character more precisely." I also want standard deviation and significance levels for Tables 1 and 2, though for such a small sample size these might not be meaningful.

BibTeX

@article{1182569,
author = {Taku Komura and Wai-Chun Lam},
title = {Real-time locomotion control by sensing gloves: Research Articles},
journal = {Comput. Animat. Virtual Worlds},
volume = {17},
number = {5},
year = {2006},
issn = {1546-4261},
pages = {513--525},
doi = {http://dx.doi.org/10.1002/cav.v17:5},
publisher = {John Wiley and Sons Ltd.},
address = {Chichester, UK, UK},
}

Wednesday, February 20, 2008

Freeman - Television Control

Freeman, William T. and Craig D. Weissman. "Television Control by Hand Gestures." In Proceedings of the IEEE Intl. Wkshp. on Automatic Face and Gesture Recognition, Zurich, June, 1995.

Summary

Freeman and Weissman present a system for controlling a television (channel and volume) using "hand gestures." The hold up their hand in front of a television/computer combo. The computer recognizes an open hand using image processing techniques. When an open hand is seen, a menu opens with controls (buttons/slider bar) to control the channel and volume. They move their hand around and hover it over the controls to activate them. To stop, they close their hand or otherwise remove their open hand from the camera's FOV. They recognize an open palm using a cosine similarity metric (normalized correlation) between a pre-defined image of a palm and every possible offset within the image.

Discussion

Not in the mood to write decent prose, so here's a list.

Is natural language really that much better? First, it contains a lot of ambiguity that mouse/keyboard don't have. Second, you'd have just as many problems defining a vocabulary of commands using language as you would gestures, especially since there are so many words/synonyms/etc.

Their example of a 'complicated gesture' is a goat shadow puppet. Seriously? I think this is a little exaggerated and a lot ridiculous.

These aren't really gestures. It's just image tracking that boils down to nothing more than a mouse. What have you saved? Just buy 10 more remotes and glue them to things so you have one in every sitting spot and they can't be lost.

I don't know the image rec. research area, so I can't comment too much on their algorithm. But this seems like it would be super slow (taking all possible offsets) and have issues with scaling (what if the hand template is the wrong size, esp too small for the actual hand in the camera image).

BibTeX

@proceedings{freeman1995televisionGestures
,author="William T. Freeman and Craig D. Weissman"
,title="Television Control by Hand Gestures"
,booktitle="IEEE Intl. Wkshp. on Automatic Face and Gesture Recognition"
,address="Zurich"
,year="1995"
,month="June"
}

Wednesday, February 13, 2008

Marsh - Shape your imagination

Marsh, T.; Watt, A., "Shape your imagination: iconic gestural-based interaction," Virtual Reality Annual International Symposium, 1998. Proceedings., IEEE 1998 , vol., no., pp.122-125, 18-18 1998

Summary

Marsh and Watt present findings on a user study where they examine how gestures are used to describe objects in a non-verbal fashion. They describe how iconic gestures (those that immediately and clearly stand for something) fall into two camps: substitutive (hands match the shape or form of object) and virtual (outline or trace a picture of the shape/object).

For their study, they used 12 subjects of varying backgrounds. They had 15 shapes from two categories: primitive (circle, triangle, sphere, cube, etc) and complex (table, car, French baguette, etc). The shapes were written on index cards and presented in the same order to each user. The users then were told to describe the shapes using non-verbal communication. Of all the gestures, 75% were virtual. For the 2D shapes, 72% of people used one hand, while the 3D objects were all describe with 2 hands. For the complex shapes, iconic gestures were either replaced or accompanied by pantomimic (how the object was used) or deictic (pointing to something) gestures, rather than iconic ones. Some complex shapes were too difficult for users to express (4 for chair, 1 each for football, table, and baguette). They also discovered 2D is easier than 3D.

Discussion

I really liked this paper. While it was a little short, I think it was neat that they were able to break up the gestures that people made. This reminded me a lot of Alvarado et al's paper where they performed the user study about how people draw. I think it's especially useful to see that if we want to do anything useful with haptics, we have to enable the users to use /both/ hands.

Some things:

How did they pick their shapes, especially the complex ones? I mean, come on, French baguette? Although, this is a really good example because it's friggin hard to mime.

They note that most of the complex objects are too difficult to express with iconic gestures alone. That's why sign languages aren't that simple to learn. Not everything can be expressed easily with just iconic gestures. This paper was good that it pointed this out and made it clear, even though it seems obvious. It also seems to drive the need for multi-modal input for complex recognition domains.

They remark that 3D is harder than 2D. Besides the fact that this claim is obvious and almost a bit silly to make, it does seem that there are 2D shapes that would be very difficult to express. For example: Idaho. I wonder if their comparison between 2D and 3D here is a fair one. Obviously adding another dimension to things is going to make it exponentially more difficult, but they're comparing things like circle to things like French baguette.

Finally, who decides if a gesture is iconic or not? Isn't this shaped by experience and perception?

BibTeX

@ARTICLE{658465,
title={Shape your imagination: iconic gestural-based interaction},
author={Marsh, T. and Watt, A.},
journal={Virtual Reality Annual International Symposium, 1998. Proceedings., IEEE 1998},
year={18-18 1998},
volume={},
number={},
pages={122-125},
keywords={computer graphics, graphical user interfaces3D computer generated graphical environments, 3D spatial information, human computer interaction, iconic gestural-based interaction, iconic hand gestures, object manipulation, shape manipulation, spatial information},
doi={10.1109/VRAIS.1998.658465},
ISSN={1}, }

Kim - Gesture Rec. for Korean Sign Language

Jong-Sung Kim; Won Jang; Zeungnam Bien, "A dynamic gesture recognition system for the Korean sign language (KSL)," Systems, Man, and Cybernetics, Part B, IEEE Transactions on , vol.26, no.2, pp.354-359, Apr 1996

Summary

Kim et al. present a system for recognizing a subset of gestures in KSL. They say KSL can be expressed with 31 distinct gestures, choosing 25 of them to use in this initial study. They use a Data-Glove, which gives 2 bend values for each finger, and a Polhemus tracker (normal 6 DOF) to get information about each hand.

They recognize signs using the following recipe:

Bin the movements along each axes (bins of width=4inches) to filter/smooth

Vector quantize the movements of each hand into one of 10 "directional" classes that describe how the hand(s) is(are) moving.

Feed the glove sensor information into a fuzzy min-max neural network and classify which of 14 postures it is, using a rejection threshold just in case it's not any of the postures

Use the direction and posture results to say what the sign is intended to be

Discussion

They make the remark that many of the classes are not linearly separable. This is a problem in many domains. Support vector machines can sometimes do a very good job at separating data. I wonder why no one has used them so far. Probably because they're fairly complex.

I also like the idea of thinking as gestures as a signal. I don't know why, but this analogy has escaped me so far. There is a technique for detecting "interesting anomalies" in signals using PCA. I wonder if this would work in the segmentation problem?

How do they determine the initial position for the glove coordinates? If they get it wrong, all their measurements will be off and the vector quantization of their movements will probably fail. They should probably just skip this whole initial starting point thing and use change from the last position. Maybe that's what they really mean, but it's unclear.

Also, is seems like their method for filtering/smoothing the position/movement data by binning the values is a fairly hackish technique. There are robust methods for filtering noisy data that should have been used instead.

And finally for their results. They say 85%, which doesn't seem /too/ bad for a first try. But then they try to rationalize that 85% is good enough, saying that "the deaf-mute who [sic] use gestures often misunderstand each other." Well that's a little condescending, now, isn't it? And they also blame everything else besides their own algorithm, and "find that abnormal motions in the gestures and postures, and errors of sensors are partly responsible for the observed mis-classification." So you want things to work perfectly for you? You want life to play fair? News flash: if things were perfect, you would be out of the job and there would be no meaning or reason for the paper you just wrote. Things are hard. Deal with it. Life's not perfect, my sensors won't give me perfect data, and I can't draw a perfectly straight line by hand. That's not an excuse to not excel in your algorithm and account for those imperfections.

Also, how did they combine the movement quantized data (the ten movement classes) with the posture classifications? Postures were neural nets, not the combination, right?

BibTeX

@ARTICLE{485888,
title={A dynamic gesture recognition system for the Korean sign language (KSL)},
author={Jong-Sung Kim and Won Jang and Zeungnam Bien},
journal={Systems, Man, and Cybernetics, Part B, IEEE Transactions on},
year={Apr 1996},
volume={26},
number={2},
pages={354-359},
keywords={data gloves, fuzzy neural nets, pattern recognitionKorean sign language, data-gloves, dynamic gesture recognition system, fuzzy min-max neural network, online pattern recognition},
doi={10.1109/3477.485888},
ISSN={1083-4419}, }

Monday, February 11, 2008

Gesture descriptions for Trumpet Fingerings

I was going to do ASL fingerspelling as well, but others did it so I don't feel the need to repeat what they already said about it. Mine was basically exactly the same as theirs. So, instead, here's just the trumpet stuff.

trumpetFingeringGestures.pdf

Natural Language Descriptions for 5 easy ASL signs

Natural language descriptions of 5 common ASL signs
Joshua Johnston
Haptics
11 February, 2008

HELLO

Put the right hand into the shape of a “B”--all four fingers extended and placed together vertically from the palm, with the thumb bent and crossing the palm. Raise the b-hand and put the tip of the forefinger to your right temple. Move the b-hand away from the head to the right with a quick gesture.

(as if waving)

NAME

Put both hands into the sign for the letter “U”--the fore- and middle finger of each hand straight and extended from the palm and touching, the remaining fingers and thumb curled together in front of the palm (like a “2” or “scissors”) with the fingers together). Bring the u-hands in front of the body, fingers pointing parallel to the ground, in the shape of an “X.” Tap the right hand's extended fingers on top of the left hand's extended fingers twice.

(sign your hand on the “X”)

NICE

Put all the fingers (and thumb) together and extended on both hands to form a flat surface. Put the left hand palm up in front of the body and hold it still. Take the right hand and put its palm to the palm of your left hand, then with a smooth motion slide the right hand down the fingers of and off the left hand.

(this sign also means “clean” and “pure”, as if wiping the dirt off one hand with the other)

MEET

Put both hands into the sign for the letter “D”--the forefinger extended vertically with the rest of the fingers and thumb curled in front of the palm. Bring the d-hands together in front of the body, touching the curled fingers together.

(d-hands represent people coming together)

SANDWICH

Put both hands together, palms flat and fingers/thumb extended and together. Bring the tips of the fingers up and touch them to your mouth.

(hands are the bread, and you're eating it)

Cassandra - POMDPs

Anthony Cassandra. "A Survey of POMDP Applications." Presented at the AAAI Fall Symposium, 1998. http://pomdp.com/pomdp/papers/index.shtml, 11 Feb, 2008.

Summary

Not much to say about gesture recognition, which is not surprising since POMDPs are used for artificial intelligence in the area of planning. Think of a robot that has a goal and only a limited visual range (can't see behind obstructions, etc.). A POMDP might be used in this situation to evaluate different actions to take based on the current state of things.

The paper does mention machine vision and gesture recognition. The context here is that the computer uses a POMDP to focus a camera and a fovea (high resolution area for fine-grained vision) on facial expressions, hand movements, etc. The fovea is important because it is limited, and the areas outside it either have a much lower resolution (to reduce computational burden) or cannot be seen at all (outside the FOV).

Discussion

I really don't think POMDPs can be used for our purposes in gesture recognition.

However, this is a nice paper if you want examples of how POMDPs can be used in multiple domains.

That is all.

BibTeX

@UNPUBLISHED{cassandra1998pomdps
,author={Anthony Cassandra}
,title={A Survey of POMDP Applications}
,year={1998}
,note={Presented at the AAAI Fall Symposium}
}

Song - Forward Spotting Accumulative HMMs

Daehwan Kim; Daijin Kim, "An Intelligent Smart Home Control Using Body Gestures," Hybrid Information Technology, 2006. ICHIT'06. Vol 2. International Conference on , vol.2, no., pp.439-446, Nov. 2006

Summary

Song and Kim present an algorithm for segmenting a stream of gestures and recognizing the segmented gestures. They take a sliding window of postures (observations) from the stream and feed them into a HMM system that has one model per gesture class, and one "Non-Gesture" HMM. They say that a gesture has started if the max probability from one of the gesture HMMs is greater than the probability of the Non-Gesture HMM. They call this the competitive differential observation probability, which is the difference between the max gesture prob and the non-gesture prob (positive means gesture, negative means non-gesture, and crossing 0 means starting/ending a gesture).

One a gesture is observed to have started, they begin classifying the gesture segments (segmenting the segmented gesture). They feed the segments into the HMMs and get classifications for each segment. Once the gesture is found to have terminated (the CDOP drops below 0, or the gesture stream becomes a non-gesture), they look at the classification results for all the segments and take a majority vote to determine the class for the whole gesture.

So we have a sliding window. Within that window, we decide a gesture starts and later see that it ends. Between the start and end points, we segment the gesture stream further. Say there are 3 segments. Then we'd classify {1}, {12}, and {123}. Pretend {1} and {123} were "OPEN CURTAINS" and {12} was "CLOSE CURTAINS." The majority vote, after the end of the gesture, would rule the gesture as "OPEN CURTAINS."

They give some results, which seem to show their automatic method performs better than a manual method, but it's not clear what the manual method is. They seem to get about 95% accuracy classifying 8 gestures made with the arms to open/close curtains and turn on/off the lights.

Discussion

So basically they just use a probabilistic significance threshold to say if a gesture has started or not, as determined by the classification of an observation as a non-gesture (like Iba's use of a wait state when recognizing robot gestures). So don't call it the CDOP. Call it a "junk" class or "non-gesture" class. They made it much harder to understand than it is.

When they give their results in Figure 5 and show the curves for manual segmentation, what the heck does \theta mean? This wasn't explained and makes their figure all but useless.

So this seems like a decent method for segmenting gestures...10 years ago. Iba had almost the exact same thing in his robot gesture recognition system, and I'm sure he wasn't the first. Decent results, I think (can't really interpret their graph), but nothing really noteworthy.

The only thing they do differently is do a majority vote from the sub-segmentation of their segmenting. Yeah, confusing. I'm not sure how much this improves recognition, as they did not compare with/without it. It seems to me like it would only take up more computation time for gains that weren't that significant.

BibTeX

@ARTICLE{song2006forwardSpottingAccumulativeHMM,
title={An Intelligent Smart Home Control Using Body Gestures},
author={Daehwan Kim and Daijin Kim},
journal={Hybrid Information Technology, 2006. ICHIT'06. Vol 2. International Conference on},
year={Nov. 2006},
volume={2},
number={},
pages={439-446},
doi={10.1109/ICHIT.2006.253644},
ISSN={}, }

Friday, February 8, 2008

Ip - Cyber Composer

Ip, H.H.S.; Law, K.C.K.; Kwong, B., "Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation," Multimedia Modelling Conference, 2005. MMM 2005. Proceedings of the 11th International , vol., no., pp. 46-52, 12-14 Jan. 2005

Summary

Ip et al. describe their Cyber Composer system. The system uses rules of music theory and gesture recognition to allow users to create dynamic music with the use of hand gestures. The system allows for the control of tempo/rhythm, pitch, dynamics/volume, and even the use of a second instrument and harmony. With the help of various theory rules, including chord progression and harmonics, they assert their system can produce "arousing" musical pieces.

Discussion

Not much is given in the way of technical details (well, nothing, actually) this is a good proof of concept. To me, the gestures seem intuitive, even if they are a little convoluted as the same type of gesture may do many things based on context. This would be a good class project, I think, with a little more gesture recognition and control over the final product. Maybe more like a real composer, where different instrument groups are located in space, and you can point at them and direct them to modify group dynamics. Who knows.

BibTeX

@ARTICLE{ip2005cyberComposer,
title={Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation},
author={ Ip, H.H.S. and Law, K.C.K. and Kwong, B.},
journal={Multimedia Modelling Conference, 2005. MMM 2005. Proceedings of the 11th International},
year={12-14 Jan. 2005},
volume={},
number={},
pages={ 46-52},
doi={10.1109/MMMC.2005.32},
ISSN={1550-5502 }, }

Wednesday, February 6, 2008

Li - SImilarity Measure (SVD angular) for Stream Segmentation and Recognition

Li, C. and Prabhakaran, B. 2005. A similarity measure for motion stream segmentation and recognition. In Proceedings of the 6th international Workshop on Multimedia Data Mining: Mining integrated Media and Complex Data (Chicago, Illinois, August 21 - 21, 2005). MDM '05. ACM, New York, NY, 89-94. DOI= http://doi.acm.org/10.1145/1133890.1133901

Summary

Li and Prabhakaran propose a new gesture classification algorithm that is easily generalizable to many input methods. They use the SVD of a motion matrix. A motion matrix has columns that are the features of the data (like the joint measurements from a CyberGlove) and rows that are steps through time. SVD is a mathematical procedure that produces a set of eigenvectors and eigenvalues for the matrix (they use matrix M = A'A, where A is the motion matrix, for computational efficiency). The top k eigenvectors are used (a parameter, with empirical evidence supporting k=6 as enough to perform well). To compare to motion matrices, the eigenvectors are compared with their dot product (angle between the vectors), weighted by the ratio of the eigenvalues for those vectors. A value of 0 means that the matrices have nothing in common, as all the eigenvectors are orthogonal. A value of 1 means the matrices have collinear eigenvectors. They call this kWAS, k weighted angular similarity (for k eigenvectors, and the weighted dot product/cosine metric).

Their algorithm works as follows. Start with a library of matrices and compute the eigenvectors/values for them. Start watching the stream of incoming data, segmenting it with minimum length l and max length L, stepping through the stream with steps size \delta. Look at all the chunks in the stream, call the matrices Q, and compare them to all the P. The Q,P pairing that has the highest kWAS score is selected as the correct answer, and the classification starts from the end of the segment with the max score.

They report that their algorithm can recognize CyberGlove gestures (not clear if it's isolated patterns or streams) with 99% accuracy with k=3, and in motion capture data with 100% accuracy with k=4. These figures aren't clear as to what they mean, however.

Discussion

So their method isn't really for segmentation. They still just look at different sliding windows of data and pick one that works. It works well without the use of holding positions or neutral states, as many other systems impose on users to delineate gestures. However, Iba et al's system can do the same thing using hidden Markov models with a built in wait state.

However, as far as a new classification approach is conerened, this is a nice approach because it seems to give decent results and is not another HMM.

They never say how they pick delta. I wonder how different values affect accuracy / running time of the algorithm.

Some people might be concerned with the fact that once you do the eigenvectors, you lose temporal information. I can see where this would be a concern for some things. However, most of the time you can get good classification/clustering results without the need for perfect temporal information. It can even be the case that temporal information tends to confuse the issue, making things hard to compute and compare.

BibTeX

@inproceedings{1133901,
author = {Chuanjun Li and B. Prabhakaran},
title = {A similarity measure for motion stream segmentation and recognition},
booktitle = {MDM '05: Proceedings of the 6th international workshop on Multimedia data mining},
year = {2005},
isbn = {1-59593-216-X},
pages = {89--94},
location = {Chicago, Illinois},
doi = {http://doi.acm.org/10.1145/1133890.1133901},
publisher = {ACM},
address = {New York, NY, USA},
}

Hernandez-Rebollar - Accelerometers and Decision Tree for ASL

Hernandez-Rebollar, J.L.; Lindeman, R.W.; Kyriakopoulos, N., "A multi-class pattern recognition system for practical finger spelling translation," Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on , vol., no., pp. 185-190, 2002

Summary

Rebollar et al. present a new algorithm for classification of ASL finger spelling letters (J and Z, the only letters that move, are statically signed at the ending posture of the gesture). They create their own glove so they don't have to sink a lot of money in to the expensive options currently available. Their gloves uses 5 accelerometers, one per finger, that measure in two axes. The y axis is aligned to point at the tip of each finger, and measures flexion and pitch. The x axis gives an idea about roll, yaw, and abduction.

They take the ten measurement values (two axes per finger, 5 fingers) and convert them to a 3D vector. The first dim is the sum of the x-axis values, the second is the y-axis, and the third is the y-axis value of the index finger, which they claim is adequate for describing the bentness of the palm.

The 3D vector is fed into a decision tree. For 21/26 letters, 5 signers doing 10 reps of each letter, they get 100% accuracy. For the I and Y, they get 96%. For U,V, and R, the accuracy is 90%, 78%, and 96%.

Discussion

Again, another paper where they sum all their values to get a global picture. This is a horrible idea as fingers will mask each other. At least sum the square of the values, so you can see if some are really high compared to others. Or, better, yet, use the 10 dimensions for the decision tree. It's really not that hard.

It was nice to see something besides an HMM, and they do get pretty good results. However, I'm ready for J and Z to move.

I also like their hardware approach. Seems simple and a lot less expensive than dropping 10-30K on a CyberGlove.

BibTeX

@ARTICLE{rebollar2002multiClassFingerSpelling,
title={A multi-class pattern recognition system for practical finger spelling translation},
author={Hernandez-Rebollar, J.L. and Lindeman, R.W. and Kyriakopoulos, N.},
journal={Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on},
year={2002},
volume={},
number={},
pages={ 185-190},
doi={10.1109/ICMI.2002.1166990},
}

Harling - Hand Tension for Segmentation

Philip A. Harling and Alistair D. N. Edwards. Hand tension as a gesture segmentation cue. In Philip A. Harling and Alistair D. N. Edwards, editors, Progress in Gestural Interaction: Proceedings of Gesture Workshop '96, pages 75--87, Springer, Berlin et al., 1997.

Summary

Harling and Edwards address the problem of segmenting gestures in a stream of data from a power glove. Their assumption is that when we purposefully want our hands to convey information, they will be tense. When the hand is "limp", the user is not trying to convey information.

Tension is measured by imagining rubber bands attached to the tip of the finger, one parallel to the x axis and the other to the y-axis. The rubber bands have certain elastic moduli, and the tension in the system can be solved with physics equations. To get an idea of overall hand tension, the values for each finger are summed.

They evaluate their idea by examining two different sayings in British sign language: "My name" and "My name me". They find dips in the 'tension graph' between each gesture, and claim an algorithm could segment at these points of low tension.

Discussion

Seems pretty nice. It's good to have an idea of what we can do to solve the segmentation issue. However, I wonder if some gestures are performed with a "limp" hand. Their idea of tension is maximized when the finger is either fully extended or fully closed, so anything where the finger is halfway will not work. Also, perhaps you naturally stand with your hand clenched in your relaxed position, so non-gestures would be tense.

I don't like that they sum the tension in each finger to get a total hand tension. I think we need information per finger, otherwise it seems like you could miss fingers moving in ways that kept the tension at the same level.

Their testing was /not/ very thorough. Another poor results section.

BibTeX

@inproceedings{harling1996handTensionSegmentation
,author = "Philip A. Harling and Alistair D. N. Edwards"
,title = "Hand Tension as a Gesture Segmentation Cue"
,booktitle = "Gesture Workshop"
,pages = "75-88"
,year = "1996"
}

Lee - Interactive Learning HMMs

Lee, Christopher, and Yangsheng Xu. "Online, Interactive Learning of Gesture of Human/Robot Interfaces."

Okay, right off the bat, this paper has nothing to do with robots. Why put it in the title?

Summary

Lee and Xu present an algorithm for classifying gestures with HMMs, evaluating the confidence of each classification, and using correct classifications to update the parameters of the HMM. The user has to wait for a bit between gestures to aid in segmentation. To simplify the data, they use fast Fourier transforms (FFTs) on a sliding window of sensor data from the glove to collapse the window. They then feed the FFT results to vector quantization (using an off-line codebook generated with LBG) to collapse the vector to a one dimension symbol. The series of symbols are fed into HMMs, one per class, and the class with the highest Pr(O|model) is selected as answer. The gesture is then folded into the training set for that HMM and the parameters are updated.

They also introduce a confidence measure for analyzing their system's performance, which is the log of the sum of the all ratios of an incorrect HMMs prob for a gesture / the corrent HMMs prob for a gesture. If a gesture is classified correctly, the correct HMM will have a higher prob than all the incorrect HMMs and all the ratios will be < 1, meaning the log of the sum of them will be < 0. If all the probabilites are about the same, the classifier is unsure and the ratios will all be around 1, meaning the log will be around 0. They show that starting with one training example, they achieve high and confident classification after only a few testing examples are classified and used to update the system.

However, they're only using a set of ASL letters that are "amenable to VQ clustering."

Discussion

I do like the idea of online training an updating of the model. However, after a few users, you lose the benefit so it's just better to have a good set of training data that's used offline before any recognition takes place, simplifying your system and reducing workload.

I don't like that you have to hold your hand still for a little bit between gestures. I would have liked to seen a system like the "wait state" HMM system discussed in Iba, et al. "An architecture for gesture based control of mobile robots." I'd like to see a better handle on the segmentation problem. They do mention using acceleration.

Their training set is too small and easy, picking things that are "amenable to VQ clustering", so I don't give their system much credit.

Monday, February 4, 2008

Chen - Dynamic Gesture Interface w/ HMMs

Chen, Qing.... "A Dynamic Gesture Interface for Virtual Environments Based on Hiddean Markov Models." HAVE 2005

Summary

Chen et al. use hidden Markov models (HMMs) to classify gestures (their focus is a simple domain of three gestures). The algorithm they use captures the standard deviation of the different bend sensor values on a glove, with the argument/idea that using the std., they don't have to worry about segmentation. They feed the std data into HMMs and classify like that. Their three gestures are very simple and are used to control three axes of rotation for a virtual, 3D cube.

They give no recognition results.

Discussion

I'm not sure these guys are too well versed in machine learning. This paper is pretty weak. I'll just make a laundry list instead of trying to tie all my complaints together in prose.

They mention other approaches (Kalman filters, dynamic time warping, FSM) that have been used, but state they have "very strict assumptions." Okay, like what? Kalman filters and hidden Markov models pretty much do the exact same thing, so why will HMMs do better than Kalman filters?

They say (page 2, first par.) that gestures are noisy and even if a person does it the same way, it will still be different. Duh. Too bad. Measurements and data are noisy, just like everything in machine learning. Otherwise, you'd just look it up in a hash table and save yourself a lot of trouble.

It's the Expectation-MAXIMIZATION algorithm, not -Modification.

They claim to avoid the need for segmentation. Okay, then what are you computing the standard deviation of? You have to have some sort of window of points to do the calculations on. I suppose their assumption is they just get the gesture in a window, not half of one, and things happen by magic.

Weak paper. Do not want. Would not buy from seller again.

Wednesday, January 30, 2008

Iba - Robots

Iba, Soshi, J. Michael Vande Weghe, Christiaan J. J. Paredis, and Pradeep K. Khosla. "An Architecture for Gesture-Based Control of Mobile Robots." Intelligent Robots and Systems, 1999.

Summary

Iba et al. describe a system that uses gestures collected from a CyberGlove (for finger/hand position) and Polhemus (hand tracking in six degrees of freedom) and recognized using a hidden Markov model to control a robot. Their argument for using a glove-based interface is that it can be a more intuitive method for controlling robot movement, etc. Not necessarily for one robot, as a joystick can work with a higher degree of accuracy. Their primary claim is that for groups of robots, where controlling each individual robot becomes intractable and burdensome, are easily controlled as a group using gestures such as pointing and general motion commands. The commands were open, flat hand to continue motion, pointing to 'go there', wave left or right to turn that direction, and closed fist to stop.

Their hardware samples finger and wrist position and flexion at a rate of 30 HZ. The data gathered is sent to a preprocessor, with the 18 data points offered by the glove undergoing linear combinations to 10 values and then augmented with the first derivatives (change from the last point in time). The 20 dimensional vectors are vector quantized into a codeword. The codeword represents a coarse-grained view of the position/movement of the fingers/hand, with the codewords trained off-line. 'Postures', then, become codewords, and gestures are sequences of codewords.

The last n codewords are fed into an HMM, which contains a method for rejecting (a 'wait' state branching to the HMM for each gesture), and the gesture is classified to the HMM that gives the highest probability (forward algorithm).

They test their algorithm with an HMM both with and without the wait state to show that the wait state helps to reject false positives, which is of concern because you don't want the robot to move if you don't mean it to. Whereas for a false negative, the gesture can simply be repeated. With the wait state, they got 96% true positives, with only 1.6/1000 false positives. Without the wait state, they got 100% true positives but 20/1000 false positives.

Discussion

How did they come up with the best linear combination to use when reducing the glove data from 18 values to 10?

I would like to see details on how they created their codebook. They say they covered 5000 measurements that are representative of the feature space, but the feature space in this case is huge! Say each of the 18 sensors have 255 values each. The 6 DOF of the hand tracker are three angular measurements, with 360 values (assuming integral precision), and three real valued dimension measurements. Say the tracker is accurate to the inch, and the length of the cord is ten feet. Let's make it easier for them and say you can only stand in front of the tracker, and not behind, so that's a ten foot half sphere with volume (1/2)*(4/3)*PI*(10^3) = 4/6 * 3 * 1000 = 2000 cubic feet. Let's cut that in half because some of the sphere is inaccessible (goes into the floor or above the ceiling), so 1000 square feet, or 1000 * 12^3 = 1.728e6 square inches. So the number of possible values for the entire space of possible values coming from the hardware is (18*255)*(3*360)*(3*1.728e6). My math is probably off, and so are the assumptions of the values of the ranges, but even if I'm off by 3 orders of magnitude, that's still a WHOLE FRIGGIN LOT MORE THAN 5000 POSITIONS. Now, how did they 'cover the entire space' adequately? Maybe they did, I don't know, but I'm skeptical. I suppose my beef is with their claim that they cover the ENTIRE SPACE. I doubt it.

Something like multi-dimensional scaling might tell you which features are important. Or you could use an iterative, interactive process for creating new codewords. Something like starting with their initial book, and then for each quantized vector (or a random sampling), seeing if it is 'close enough' to the others, or fits into a cluster with 'high enough' probability (if your codeword clusters were described by mixtures of Gaussians, the codeword being the means). If it's not good enough, start a new cluster. Maybe they did something like this, but they didn't say.

So aside from those two long, preachy paragraphs, I really liked this algorithm. Quantizing the codewords means your HMM only has to deal with a certain number (32) of different inputs, making them discretized and easier to train, and you know exactly what to expect.

BiBTeX

@ARTICLE{iba1999gestureControlledRobots,
title={An architecture for gesture-based control of mobile robots},
author={Iba, S.; Weghe, J.M.V.; Paredis, C.J.J.; Khosla, P.K.},
journal={Intelligent Robots and Systems, 1999. IROS '99. Proceedings. 1999 IEEE/RSJ International Conference on},
year={1999},
volume={2},
number={},
pages={851-857 vol.2},
keywords={data gloves, gesture recognition, hidden Markov models, mobile robots, user interfacesHMM, data glove, gesture-based control, global control, hand gestures, hidden Markov models, local control, mobile robots, wait state},
doi={10.1109/IROS.1999.812786},
ISSN={}, }

Deering - HoloSketch

Deering, Michael F. "HoloSketch: A Virtual Reality Sketching / Animation Tool." ACM Transactions onf Computer Human Interaction (2.3), 1995: 220-38.

Summary

Deering describes a system that uses a 3D mouse, stereo CRT, and head/eye tracking to create a system capable of drawing in three dimensions and being able to view the objects in three dimensions by just moving the head around. The 3D mouse has a digitizer rod attached to it, acting as a wand that is used to draw and manipulate in 3D. Different button/keyboard combinations can be used to change the modality of the drawing program. The user can pull up a 3D context menu to perform different drawing and editing actions, including the drawing of many 3D primitives, drawing operations like coloring, moving, selecting, resizing, and even setting up animations. Their system is accurate enough in its 3D rendering that a physical ruler can be held to the projected image and be accurate.

There are no algorithms or true implementation details presented in this paper, so I don't feel the need to do much summarization. You draw in 3D with a 3D mouse with a 'wand' poking out of it, much like you would in any 2D paint program. You look at the object in true 3D thanks to stereoscopic display and head/eye tracking.

Discussion

I was fairly impressed with this, especially the accuracy in 3D rendering that was obtained. I'd like to see what could be done with this now with modern hardware. Especially with a truly wireless pen, for unconstrained 3D movement. Or even different pens for doing different things, so as to reduce button clutter and complexity. I think this could be a super killer app, especially with sketch recognition capabilities! Or turning the stereo CRT (they have stereo LCD, btw) into wearable glasses for more of a HUD approach--augmented reality.

I bet Josh P. drooled over this paper when he read it. But other than that, since there isn't an algorithm or anything besides interface information, I don't think I have much more to say that's really that useful.

BiBTeX

@article{deering1995holosketch,
author = {Michael F. Deering},
title = {HoloSketch: a virtual reality sketching/animation tool},
journal = {ACM Transactions on Computer-Human Interaction},
volume = {2},
number = {3},
year = {1995},
issn = {1073-0516},
pages = {220--238},
doi = {http://doi.acm.org/10.1145/210079.210087},
publisher = {ACM},
address = {New York, NY, USA},
}

Tuesday, January 29, 2008

Rabiner and Juang -- Tutorial on HMMs

Rabiner, L.; Juang, B., "An introduction to hidden Markov models," ASSP Magazine, IEEE [see also IEEE Signal Processing Magazine] , vol.3, no.1, pp. 4-16, Jan 1986

Summary

Rabiner and Huang give an overview of hidden Markov models and some of the things you can do with them.

HMMs are good for representing temporal sequences of data. The Markov property says that a current property of the system (such as an observation made on it), is affected by the previous observation. They work by holding information about a set of states, with transitions available between the states with different probabilities. Each state has a distribution of outputs. So if you were to use an HMM to generate a sequence of outputs (which is not how you use them, and doing something like this is a bad idea), you'd take a random walk at an initial state (chosen by the prior probabilities \pi of the HMM model). At the state you'd choose an output based on the state's output distribution, and then transition to a new state based on the transition probs from that state. Recurse until you output the desired number of symbols.

Some neat things to do with hidden Markov models:

Given an model and observation sequence, compute the probability of that sequence occurring from that model. Forward or Backward Algorithms

Given a model and observation sequence, compute the sequence of states through the model that has the highest probability of producing the output (optimal path). Viterbi Algorithm

Given a set of observations, determine the parameters of the model with maximal likelihood. Baum-Welch Algorithm

Discussion

Hidden Markov models are the gold standard for many machine learning classification tasks, including handwriting and speech recognition. While they have many potential powerful uses, they're still not a silver bullet for all tasks, especially if used incorrectly.

BibTeX

@ARTICLE{rabiner1986introHMMs
,title={An introduction to hidden {Markov} models}
,author={L. R. Rabiner and B. H. Juang}
,journal={IEEE ASSP Magazine}
,year={1986}
,month={Jan}
,volume={3}
,number={1}
,pages={4-16}
,ISSN={0740-7467}
}

Wednesday, January 23, 2008

Allen et al -- ASL Finger Spelling

Allen, J.M.; Asselin, P.K.; Foulds, R., "American Sign Language finger spelling recognition system," Bioengineering Conference, 2003 IEEE 29th Annual, Proceedings of , vol., no., pp. 285-286, 22-23 March 2003

Summary

Allen et al. want to create a wearable computer system that is capable of translating ASL finger spelling used by the deaf into both written and spoken forms. Their intention is to lower communication barriers between deaf and hearing members of the community.

Their system consists of a CyberGlove worn by the finger speller. The glove uses sensors to detect bending in the fingers, palm, finger and wrist abduction, thumb crossover, etc. These glove is polled at a controlled sampling rate. The vector of sensor values is fed into a perceptron neural network that has been trained with examples of wach of the 24 different letters ('J' and 'Z' require hand motion, so were left out of this study). The classification output is given by the neural network, and is the right letter 90% of the time. Their experiments were only based on one user, however.

Discussion

First, the authors of this paper are very condescending toward the Deaf community. If any Deaf people were to ever read this article, they would be seriously pissed. Obviously I'm not Deaf. Obviously I can't speak for all Deaf people. That being said, the Deaf community is very strong (I capitalize Deaf on purpose, as that's the way Deaf culture sees itself). They work hard to make themselves independent, not needing the help or assistance of the hearing. The motivation for the paper is sound, and technology like this would indeed lower some of the communication barrier.

This doesn't seem too bad as a proof of concept. Motion needs to be incorporated to get the 'J' and 'Z' characters into play. This system also needs to be ***fast*** as the Deaf can finger spell incredibly rapidly, as quickly as you can spell a word verbally. Natural finger spelling is not about making every letter distinct, but about capturing the "shape" of the word (same way your brain works when it reads words on a page, remember that Cambridge study thing? http://www.jtnimoy.net/itp/cambscramb/). How distinct do the letters have to be for their approach to work? What sampling rate do they use? Can it be done real time (I guess no, since they say MATLAB stinks at real time).

Also, I would like to see results on misclassifications. Which letters do poorly (m and n look alike, so do a, e, and s)? They also point out accuracy is user specific. Finger spelling is a set form, so surely there are ways to generalize recognition. Just train on more than one person. Neural nets could also be used to train the 'in-between' stuff and give a little context for the letters before and after a transition.

BiBTeX

@inproceedings{allen2003aslFingerSpelling
,author={Jerome M. Allen and Pierre K. Asselin and Richard Foulds}
,title={{American Sign Language} finger spelling recognition system}
,booktitle={29th Annual IEEE Bioengineering Conference, 2003}
,year={2003}
,month={March}
,pages={285-286}
,doi={10.1109/NEBC.2003.1216106}
}