Summary
They're trying to make a recognition framework that is device independent, so you can plug any data into it from any device. The first layer is the raw data, and each sensor value is application dependent. The second layer is a set of predicates that combine raw values into postures. The third layers is a set of temporal predicates that describe changes in posture over time. The fourth layers is a set of gestural templates that assign temporal patterns of postures to specific gestures.
Sensor values from layer 1 are mapped to predicates by hand for each device. Templates are added once, combining predicates. Use a bunch of neural networks to combine predicates into temporal data, and a bunch more neural networks to combine temporal data into gestures. Train all these networks with a whole lot of data.
Discussion
They omit stuff with temporal data from their experiments, so only ASL letters excluding Z and J. This is pretty cheesy. Also, they are shooting for device independence but you still have to map and train the connections between raw sensor values and predicates by hand, for each device. I understand you'd have to do this for any application, but it seems to defeat their purpose. I guess their benefits come at the higher levels, where you use predicates regardless of how they were constructed.
This seems crazy complicated for /very/ little accuracy. Using neural networks to classify the static ASL letters (all but Z and J), they only get 67\% accuracy. Other approaches are able to get close to 95-99% for the same data. I guess things are a little too complex.
 
 
1 comment:
I see what you did there. I doubt it.
Post a Comment