Monday, April 14, 2008

Fels - Glove-Talk II

S. Sidney Fels and Geoffrey E. Hinton. "Glove-TalkII—A Neural-Network Interface
which Maps Gestures to Parallel Formant Speech Synthesizer Controls." IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 1, JANUARY 1998.

Summary



CyberGlove with a Polhemus 6 DOF tracker, Contact Glove to measure contacts of fingers and thumbs, and a foot pedal. Three neural networks are implemented to extract speech parameters from the input devices and feed them to a speech synthesizer. Hand height controls pitch. Pedal controls volume.

A V/C network determines if the user is trying to make a vowel sound or a consonant sound. The inputs are finger flex values, with 5 sigmoid feed-forward hidden units, and the output is probability of making a vowel sound. Vowel is specified by user keepind all fingers unbent and the hand open.

One network determines the vowel sound the user is trying to make. Vowel sounds are determined by XY position in space, as measured by the Polhemus. An RBF network determines what position the user is in and outputs the appropriate vowel sound parameters for the speech synth.

The last network looks at the contact glove data, determining which fingers are touching the thumb. Consonant phonemes are mapped to different hand configurations and pattern-matched with the network. The input is flex values, the output is consonant speech synth parameters.

100 hours of training by one poor sap, who seemed to have provided 2000 examples of input, and he can produce "intelligible and somewhat natural sounding" speech, with the added bonus that he "finds it difficult to speak quickly, pronounce polysyllabic words, and speak spontaneously."

Discussion



First, a caveat. This is a neat idea. I've not seen gesture rec applied to something like this. As far as the idea goes, I'd give it an 8.5 / 10. And also, it's important to remember that humans, using their mouth parts and vocal tract, take what...5 years?...to learn how to produce decent speech. So of course something like this will come with the high cost of training.

Second, the approach here is poor. The system is far too complicated with all the pedals and hand-wavey motions. One obvious way to simplify it is to remove the second glove (Contact Glove) completely. The authors don't really say what it's used for, and it seems like it's not used for much, especially if the pedal can control stops, etc. For all the vowel and consonant networks, they're basically just performing a nearest neighbor lookup. Why don't they do that and make things much simpler? Perhaps the blending and smoothness of the speech parameters moving from one sound to another, as the network will provide function fitting and interpolation of values. But I think nearest neighbor would work well.

There are values to compute the centers and variances for the RBFs in their networks using the training data. No need for hand-picked or hard-coded values.

So if the idea gets 8.5/10, their execution gets a 3/10.

1 comment:

LAXMAN said...

Wonderful post! We are linking to this particularly great post on our website. Keep up the great writing.