Thursday, November 8, 2007

Adler and Davis -- Speech and Sketching

Adler, A., and R. Davis. "Speech and sketching: An empirical study of multimodal interaction." EUROGRAPHICS, 2007.

Summary



Adler and Davis set up a user study to analyze the ways in which speech and sketch interact in a multimodal interface. They created software that runs on two Tablet PCs and shows on the other what is drawn on one in real time, offering a multitude of pen colors, highlighting colors, the ability to erase, and pressure-based line thickness. This allows a participant (one of 18 students) and an experimenter to interact with each other. The participants completed 4 drawings including electronics schematics, a floor plan, and a project. The schematics were available for study before the drawing began but not during, making all four tasks as free-form as possible.

The authors recorded the drawings, video, and audio and synchronized them all. They then labeled all the data for analysis. Some of the main things they found were:

  • Creation/writing strokes accounted for the super-majority of the strokes (90%) and ink (80%) in the sketch
  • Color changes were used to denote different groupings of strokes for emphasis
  • Most of the speech was broken and repetitive. This would make it hard for a full recognizer to analyze, but the repetitions provided clues about the user's intent.
  • Speech occurred at the same time the referenced objects were being drawn
  • The ordering of speech and drawing was the same
  • The open-ended nature of the sketching and speech interaction between the experimenter and participant evoked a lot of speech and clarification from the participant when asked simple questions by the experimenter
  • Parts of the speech that did not relate to the actual drawing itself gave hints as to the user's intentions


Analyzing these results, the authors found that for the most part, speech began before the sketching did when considering phrase groups broken up by pauses in the speech. However, when looking at word groups (saying "diode" when drawing one), the sketch usually comes first. Additionally, the amount of time difference between speech and sketch is statistically significant.

So, overall, giving the user colors to play with and using speech information gives a system a lot more information to use to do a better job.

Discussion



I found it interesting that the participants gave up more information than was needed or asked for when answering the experimenter's questions, as if wanting to make sure the exp. understood completely. I wonder if the user would give up as much information if there were not a human present? I hypothesize that a human would just expect a computer to magically understand what was meant and not talk to it. Even in a wizzard-of-oz experiment, with a human really posing as the computer, I think the participant would speak much less frequently.

I don't think the seemingly contradictory information in tables 1 and 2 is surprising. If I'm giving speech that is supposed to be informative, I'm going to include extra information in my speech that isn't directly related to the actual shapes I'm drawing. But, as I draw things, I will put words in my speech to let the observer know what I just drew.

I wonder how often pauses in the user's speech accurately reflected boundaries between phrase groups and drawings. How often did users just flow from one thing to the next? My guess is that pauses are pretty darn good indication, like speed corners, but I'm just wondering. Also, what is the reason for the differences in the speech/sketch alignments when comparing word groups and phrase groups?

The authors say that complete unrestricted speech recognition (narrative) is unduly difficult. Well, wasn't unrestricted sketch recognition the same way a decade ago? Things are hard. That's why PhD's have a job. If things were easy, you wouldn't have to study for 10 years just to get good at them. Unlimited speech recognition is unduly hard right now, and wasn't a tractable option for this paper, but it will come. Understanding "enough" of the input is good for now, but what about the other stuff, like the "junk" DNA. I bet it's important too.

2 comments:

Brian David Eoff said...

Taking the whole JFK approach, paraphrasing "we choose to do these things not because they are easy, but because they are hard." Davis providing the follow up comments about how Adler would do question construction by way of the physics engine was nice.

Grandmaster Mash said...

After your comment on my blog, I can definitely imagine you at calling a voice menu system at 3am.

"Hello? Will you be my friend?"
"PARA ESPANOL, PRENSA DOS"

On a more serious note, I do think user behavior will change when either a wizard is not present, or when other non-wizards are watching them. That sentence does not read like it is actually serious.