Wednesday, January 30, 2008

Iba - Robots

Iba, Soshi, J. Michael Vande Weghe, Christiaan J. J. Paredis, and Pradeep K. Khosla. "An Architecture for Gesture-Based Control of Mobile Robots." Intelligent Robots and Systems, 1999.

Summary



Iba et al. describe a system that uses gestures collected from a CyberGlove (for finger/hand position) and Polhemus (hand tracking in six degrees of freedom) and recognized using a hidden Markov model to control a robot. Their argument for using a glove-based interface is that it can be a more intuitive method for controlling robot movement, etc. Not necessarily for one robot, as a joystick can work with a higher degree of accuracy. Their primary claim is that for groups of robots, where controlling each individual robot becomes intractable and burdensome, are easily controlled as a group using gestures such as pointing and general motion commands. The commands were open, flat hand to continue motion, pointing to 'go there', wave left or right to turn that direction, and closed fist to stop.

Their hardware samples finger and wrist position and flexion at a rate of 30 HZ. The data gathered is sent to a preprocessor, with the 18 data points offered by the glove undergoing linear combinations to 10 values and then augmented with the first derivatives (change from the last point in time). The 20 dimensional vectors are vector quantized into a codeword. The codeword represents a coarse-grained view of the position/movement of the fingers/hand, with the codewords trained off-line. 'Postures', then, become codewords, and gestures are sequences of codewords.

The last n codewords are fed into an HMM, which contains a method for rejecting (a 'wait' state branching to the HMM for each gesture), and the gesture is classified to the HMM that gives the highest probability (forward algorithm).

They test their algorithm with an HMM both with and without the wait state to show that the wait state helps to reject false positives, which is of concern because you don't want the robot to move if you don't mean it to. Whereas for a false negative, the gesture can simply be repeated. With the wait state, they got 96% true positives, with only 1.6/1000 false positives. Without the wait state, they got 100% true positives but 20/1000 false positives.

Discussion



How did they come up with the best linear combination to use when reducing the glove data from 18 values to 10?

I would like to see details on how they created their codebook. They say they covered 5000 measurements that are representative of the feature space, but the feature space in this case is huge! Say each of the 18 sensors have 255 values each. The 6 DOF of the hand tracker are three angular measurements, with 360 values (assuming integral precision), and three real valued dimension measurements. Say the tracker is accurate to the inch, and the length of the cord is ten feet. Let's make it easier for them and say you can only stand in front of the tracker, and not behind, so that's a ten foot half sphere with volume (1/2)*(4/3)*PI*(10^3) = 4/6 * 3 * 1000 = 2000 cubic feet. Let's cut that in half because some of the sphere is inaccessible (goes into the floor or above the ceiling), so 1000 square feet, or 1000 * 12^3 = 1.728e6 square inches. So the number of possible values for the entire space of possible values coming from the hardware is (18*255)*(3*360)*(3*1.728e6). My math is probably off, and so are the assumptions of the values of the ranges, but even if I'm off by 3 orders of magnitude, that's still a WHOLE FRIGGIN LOT MORE THAN 5000 POSITIONS. Now, how did they 'cover the entire space' adequately? Maybe they did, I don't know, but I'm skeptical. I suppose my beef is with their claim that they cover the ENTIRE SPACE. I doubt it.

Something like multi-dimensional scaling might tell you which features are important. Or you could use an iterative, interactive process for creating new codewords. Something like starting with their initial book, and then for each quantized vector (or a random sampling), seeing if it is 'close enough' to the others, or fits into a cluster with 'high enough' probability (if your codeword clusters were described by mixtures of Gaussians, the codeword being the means). If it's not good enough, start a new cluster. Maybe they did something like this, but they didn't say.

So aside from those two long, preachy paragraphs, I really liked this algorithm. Quantizing the codewords means your HMM only has to deal with a certain number (32) of different inputs, making them discretized and easier to train, and you know exactly what to expect.


BiBTeX




@ARTICLE{iba1999gestureControlledRobots,
title={An architecture for gesture-based control of mobile robots},
author={Iba, S.; Weghe, J.M.V.; Paredis, C.J.J.; Khosla, P.K.},
journal={Intelligent Robots and Systems, 1999. IROS '99. Proceedings. 1999 IEEE/RSJ International Conference on},
year={1999},
volume={2},
number={},
pages={851-857 vol.2},
keywords={data gloves, gesture recognition, hidden Markov models, mobile robots, user interfacesHMM, data glove, gesture-based control, global control, hand gestures, hidden Markov models, local control, mobile robots, wait state},
doi={10.1109/IROS.1999.812786},
ISSN={}, }

Deering - HoloSketch

Deering, Michael F. "HoloSketch: A Virtual Reality Sketching / Animation Tool." ACM Transactions onf Computer Human Interaction (2.3), 1995: 220-38.

Summary



Deering describes a system that uses a 3D mouse, stereo CRT, and head/eye tracking to create a system capable of drawing in three dimensions and being able to view the objects in three dimensions by just moving the head around. The 3D mouse has a digitizer rod attached to it, acting as a wand that is used to draw and manipulate in 3D. Different button/keyboard combinations can be used to change the modality of the drawing program. The user can pull up a 3D context menu to perform different drawing and editing actions, including the drawing of many 3D primitives, drawing operations like coloring, moving, selecting, resizing, and even setting up animations. Their system is accurate enough in its 3D rendering that a physical ruler can be held to the projected image and be accurate.

There are no algorithms or true implementation details presented in this paper, so I don't feel the need to do much summarization. You draw in 3D with a 3D mouse with a 'wand' poking out of it, much like you would in any 2D paint program. You look at the object in true 3D thanks to stereoscopic display and head/eye tracking.

Discussion



I was fairly impressed with this, especially the accuracy in 3D rendering that was obtained. I'd like to see what could be done with this now with modern hardware. Especially with a truly wireless pen, for unconstrained 3D movement. Or even different pens for doing different things, so as to reduce button clutter and complexity. I think this could be a super killer app, especially with sketch recognition capabilities! Or turning the stereo CRT (they have stereo LCD, btw) into wearable glasses for more of a HUD approach--augmented reality.

I bet Josh P. drooled over this paper when he read it. But other than that, since there isn't an algorithm or anything besides interface information, I don't think I have much more to say that's really that useful.

BiBTeX



@article{deering1995holosketch,
author = {Michael F. Deering},
title = {HoloSketch: a virtual reality sketching/animation tool},
journal = {ACM Transactions on Computer-Human Interaction},
volume = {2},
number = {3},
year = {1995},
issn = {1073-0516},
pages = {220--238},
doi = {http://doi.acm.org/10.1145/210079.210087},
publisher = {ACM},
address = {New York, NY, USA},
}

Tuesday, January 29, 2008

Rabiner and Juang -- Tutorial on HMMs

Rabiner, L.; Juang, B., "An introduction to hidden Markov models," ASSP Magazine, IEEE [see also IEEE Signal Processing Magazine] , vol.3, no.1, pp. 4-16, Jan 1986

Summary



Rabiner and Huang give an overview of hidden Markov models and some of the things you can do with them.

HMMs are good for representing temporal sequences of data. The Markov property says that a current property of the system (such as an observation made on it), is affected by the previous observation. They work by holding information about a set of states, with transitions available between the states with different probabilities. Each state has a distribution of outputs. So if you were to use an HMM to generate a sequence of outputs (which is not how you use them, and doing something like this is a bad idea), you'd take a random walk at an initial state (chosen by the prior probabilities \pi of the HMM model). At the state you'd choose an output based on the state's output distribution, and then transition to a new state based on the transition probs from that state. Recurse until you output the desired number of symbols.

Some neat things to do with hidden Markov models:

  1. Given an model and observation sequence, compute the probability of that sequence occurring from that model. Forward or Backward Algorithms

  2. Given a model and observation sequence, compute the sequence of states through the model that has the highest probability of producing the output (optimal path). Viterbi Algorithm

  3. Given a set of observations, determine the parameters of the model with maximal likelihood. Baum-Welch Algorithm



Discussion



Hidden Markov models are the gold standard for many machine learning classification tasks, including handwriting and speech recognition. While they have many potential powerful uses, they're still not a silver bullet for all tasks, especially if used incorrectly.

BibTeX



@ARTICLE{rabiner1986introHMMs
,title={An introduction to hidden {Markov} models}
,author={L. R. Rabiner and B. H. Juang}
,journal={IEEE ASSP Magazine}
,year={1986}
,month={Jan}
,volume={3}
,number={1}
,pages={4-16}
,ISSN={0740-7467}
}

Wednesday, January 23, 2008

Allen et al -- ASL Finger Spelling

Allen, J.M.; Asselin, P.K.; Foulds, R., "American Sign Language finger spelling recognition system," Bioengineering Conference, 2003 IEEE 29th Annual, Proceedings of , vol., no., pp. 285-286, 22-23 March 2003

Summary



Allen et al. want to create a wearable computer system that is capable of translating ASL finger spelling used by the deaf into both written and spoken forms. Their intention is to lower communication barriers between deaf and hearing members of the community.

Their system consists of a CyberGlove worn by the finger speller. The glove uses sensors to detect bending in the fingers, palm, finger and wrist abduction, thumb crossover, etc. These glove is polled at a controlled sampling rate. The vector of sensor values is fed into a perceptron neural network that has been trained with examples of wach of the 24 different letters ('J' and 'Z' require hand motion, so were left out of this study). The classification output is given by the neural network, and is the right letter 90% of the time. Their experiments were only based on one user, however.

Discussion



First, the authors of this paper are very condescending toward the Deaf community. If any Deaf people were to ever read this article, they would be seriously pissed. Obviously I'm not Deaf. Obviously I can't speak for all Deaf people. That being said, the Deaf community is very strong (I capitalize Deaf on purpose, as that's the way Deaf culture sees itself). They work hard to make themselves independent, not needing the help or assistance of the hearing. The motivation for the paper is sound, and technology like this would indeed lower some of the communication barrier.

This doesn't seem too bad as a proof of concept. Motion needs to be incorporated to get the 'J' and 'Z' characters into play. This system also needs to be ***fast*** as the Deaf can finger spell incredibly rapidly, as quickly as you can spell a word verbally. Natural finger spelling is not about making every letter distinct, but about capturing the "shape" of the word (same way your brain works when it reads words on a page, remember that Cambridge study thing? http://www.jtnimoy.net/itp/cambscramb/). How distinct do the letters have to be for their approach to work? What sampling rate do they use? Can it be done real time (I guess no, since they say MATLAB stinks at real time).

Also, I would like to see results on misclassifications. Which letters do poorly (m and n look alike, so do a, e, and s)? They also point out accuracy is user specific. Finger spelling is a set form, so surely there are ways to generalize recognition. Just train on more than one person. Neural nets could also be used to train the 'in-between' stuff and give a little context for the letters before and after a transition.

BiBTeX



@inproceedings{allen2003aslFingerSpelling
,author={Jerome M. Allen and Pierre K. Asselin and Richard Foulds}
,title={{American Sign Language} finger spelling recognition system}
,booktitle={29th Annual IEEE Bioengineering Conference, 2003}
,year={2003}
,month={March}
,pages={285-286}
,doi={10.1109/NEBC.2003.1216106}
}

Tuesday, January 22, 2008

Krueger -- Environmental Technology

Krueger, Myron W. "Environmental technology: making the real world virtual". Communications of the ACM (36.7), July 1993: pp. 36-37.

Summary



Not much to this one. It's filler for a glossy magazine. The interesting points are the use of gesture and hand input for virtual environments. VIDEOPLACE and VIDEODESK use image capture (cameras) to track movement in the environment and allow for the interaction with a virtual world, as well as collaboration with others.

Discussion



Not really anything to say here.

BiBTeX



@article{krueger1993environmentalTechnology
,author = {Myron W. Krueger}
,title = {Environmental technology: making the real world virtual}
,journal = {Communications of the ACM}
,volume = {36}
,number = {7}
,year = {1993}
,issn = {0001-0782}
,pages = {36--37}
,doi = {http://doi.acm.org/10.1145/159544.159563}
,publisher = {ACM}
,address = {New York, NY, USA
}

Deller et al -- Flexible Gesture Recognition

Deller, Matthias, et al. "Flexible Gesture Recognition for Immersive Virtual Environments." In Proceedings of the Tenth International Conference on Information Visualization, 2006 (IV'2006), pp. 563-568, July 2006.

Summary



Deller et al. seek to create a flexible gesture recognition engine for glove-based (or other hand tracking) interface. There goal is accurate gesture recognition in an immersive 3D environment, where the user would be able to naturally utilize their hands with minimal "cognitive load" to distract them. They seek an engine that works regardless of the environment the gloves/hand tracking system is deployed in or what kind of hardware is used. They list several current approaches to hand tracking and gesture recognition, most of which they cite as needing expensive hardware or fancy image processing techniques (if cameras are involved).

Their approach is to abstract the gesture recognition into a higher level (gasp, the use of basic programming paradigms!). Regardless of how the data is captured (gloves or image processing), it is treated as a sequence of postures. A posture is the position of fingers/orientation of the hand that is held for a certain amount of time (the glove is constantly polled). A sequence of postures defines a gesture. Postures are performed and give a template during training. Many examples of the posture can be performed, even per user, and gives an 'average' template. These templates form a posture library.

When the system sees a posture (orientation of fingers and glove for certain amount of time), it preprocesses by filtering and smoothing the data to reduce the amount of noise (especially in hand orientation data, which the hardware they used was bad at determining). Smoothed data was sent to the recognizer, which compares the posture(s) to every template in the library, using a distance metric from the bend-vector (values of five finger bend sensors), flagging as a candidate those whose distance is below a threshold. Then orientation data is used to weight the match, if orientation is important for a posture. Sequences of postures make a gesture.

Tested empirically in an environment, but no hard results. :(

Discusssion



So it's the first paper of the semester and already we see the phrase "the most natural way." I immediately sensed red flags. But I think I might agree here that the most natural way to interact with your environment is through touch. Our brains our geared, after all, to tactile dexterity. We have opposable thumbs, and our use of tools is supposedly what makes us different from the monkeys and sea-horses, ad nauseum, ad infinitum. So lower the red flags. Hands are good. Pens? Maybe not, but that's a debatable issue.

Distance metrics make me uneasy, especially when you start throwing around averages and thresholds. I think this method is a good candidate for using Gaussian distributions to model the positions of the five fingers. Since you're providing multiple examples of each posture, just keep track of the average bend for each finger and the covariance. This provides a probability for matching with the template library and seems a little more robust than distances. You may also be able to integrate orientation into the same vector as the bend sensors with this approach, just as an extra dimension. For values where orientation is not important, set the std -> inifnity, so any variation does not affect the probability (or marginally).


  • Cognitive burden: holding the gesture for 300-600 ms. Is there a study on this? Would like to see some results. Seems user dependent, especially if the user is a "power-user" or "n00blet."


  • What happens when more than one posture is below the distance matching threshold? Do they just pick the lowest distance?


  • In section 5 they mention the "normal consumer grade computer". Granted, you can get a quad core, 4 GiB RAM, 256 MiB graphics card rig from Dell for $1500. But "normal consumer grade" is probably closer to the $300 Acer/e-Machines mom and dad buy you from Wal-Mart. Specifications would be nice for their target machine.



BiBTeX



@inproceedings{deller2006flexibleGesture
,author={Matthias Deller and Achim Ebert and Michael Bender and Hans
,title={Flexible Gesture Recognition for Immersive Virtual Environments}
,booktitle={Tenth International Conference on Information Visualization, 2006 (IV'06)}
,year={2006}
,month={July}
,pages={563-568}
,doi={10.1109/IV.2006.55}
,ISSN={1550-6037},
}