Tuesday, January 22, 2008

Deller et al -- Flexible Gesture Recognition

Deller, Matthias, et al. "Flexible Gesture Recognition for Immersive Virtual Environments." In Proceedings of the Tenth International Conference on Information Visualization, 2006 (IV'2006), pp. 563-568, July 2006.

Summary



Deller et al. seek to create a flexible gesture recognition engine for glove-based (or other hand tracking) interface. There goal is accurate gesture recognition in an immersive 3D environment, where the user would be able to naturally utilize their hands with minimal "cognitive load" to distract them. They seek an engine that works regardless of the environment the gloves/hand tracking system is deployed in or what kind of hardware is used. They list several current approaches to hand tracking and gesture recognition, most of which they cite as needing expensive hardware or fancy image processing techniques (if cameras are involved).

Their approach is to abstract the gesture recognition into a higher level (gasp, the use of basic programming paradigms!). Regardless of how the data is captured (gloves or image processing), it is treated as a sequence of postures. A posture is the position of fingers/orientation of the hand that is held for a certain amount of time (the glove is constantly polled). A sequence of postures defines a gesture. Postures are performed and give a template during training. Many examples of the posture can be performed, even per user, and gives an 'average' template. These templates form a posture library.

When the system sees a posture (orientation of fingers and glove for certain amount of time), it preprocesses by filtering and smoothing the data to reduce the amount of noise (especially in hand orientation data, which the hardware they used was bad at determining). Smoothed data was sent to the recognizer, which compares the posture(s) to every template in the library, using a distance metric from the bend-vector (values of five finger bend sensors), flagging as a candidate those whose distance is below a threshold. Then orientation data is used to weight the match, if orientation is important for a posture. Sequences of postures make a gesture.

Tested empirically in an environment, but no hard results. :(

Discusssion



So it's the first paper of the semester and already we see the phrase "the most natural way." I immediately sensed red flags. But I think I might agree here that the most natural way to interact with your environment is through touch. Our brains our geared, after all, to tactile dexterity. We have opposable thumbs, and our use of tools is supposedly what makes us different from the monkeys and sea-horses, ad nauseum, ad infinitum. So lower the red flags. Hands are good. Pens? Maybe not, but that's a debatable issue.

Distance metrics make me uneasy, especially when you start throwing around averages and thresholds. I think this method is a good candidate for using Gaussian distributions to model the positions of the five fingers. Since you're providing multiple examples of each posture, just keep track of the average bend for each finger and the covariance. This provides a probability for matching with the template library and seems a little more robust than distances. You may also be able to integrate orientation into the same vector as the bend sensors with this approach, just as an extra dimension. For values where orientation is not important, set the std -> inifnity, so any variation does not affect the probability (or marginally).


  • Cognitive burden: holding the gesture for 300-600 ms. Is there a study on this? Would like to see some results. Seems user dependent, especially if the user is a "power-user" or "n00blet."


  • What happens when more than one posture is below the distance matching threshold? Do they just pick the lowest distance?


  • In section 5 they mention the "normal consumer grade computer". Granted, you can get a quad core, 4 GiB RAM, 256 MiB graphics card rig from Dell for $1500. But "normal consumer grade" is probably closer to the $300 Acer/e-Machines mom and dad buy you from Wal-Mart. Specifications would be nice for their target machine.



BiBTeX



@inproceedings{deller2006flexibleGesture
,author={Matthias Deller and Achim Ebert and Michael Bender and Hans
,title={Flexible Gesture Recognition for Immersive Virtual Environments}
,booktitle={Tenth International Conference on Information Visualization, 2006 (IV'06)}
,year={2006}
,month={July}
,pages={563-568}
,doi={10.1109/IV.2006.55}
,ISSN={1550-6037},
}

4 comments:

Brandon said...

I think your probability/Gaussian distribution approach sounds interesting. Hard to say whether it would be an improvement or not though since no actual accuracy rates were given. =/

Grandmaster Mash said...

I'd rather not have computer specifications listed for each paper I see, unless the computers they use are extremely unordinary. Let's not backtrack 3 decades to when the processor you have actually makes a difference.

- D said...

I still think computer specs are important in some cases, especially if you're claiming your computer is "consumer grade" and available to the masses.

Paul Taele said...

I liked the paper as it was. It was a convenient choice for a paper to introduce the kinds of topics we'll be expecting for the course. Maybe to convenient though... It had quite a textbook-like tone, which is fine in a textbook that doesn't need results. For a research paper, how it got away without publishing them is beyond me.