Summary
The authors want to examine the relationship between gestures and meaning. They are looking for a correspondence between certain gesture and topic, irrespective of the "speaker" of the gesture. If gestures are speaker independent and depend only on topic, this can possibly improve gesture recognition accuracies.
They set up a topic-author model. Gesture features are extracted for a series of different conversations about different topics, where the speaker is making gestures to accompany his speech. They model gesture features with normals and topic/speaker gesture distributions with multinomials drawn from Dirichlet distributions (Dirichlet compound multinomial, or Polya distribution). Learning the parameters for their models, they use Bayesian inference and statistical significance tests to determine that 12% of all gestures belong to specific topics. Thus, if we have prior information about the topic (ie, speech), we can use contextual information to improve gesture recognition.
Discussion
The paper's purpose is to look for a link between gestures and topic. They find a link, but this isn't too surprising given their limited dataset. Furthermore, many of their videos (from which gestures and speech was extracted) were very limited in scope. It's my hypothesis that given a more general scope of data, the percentage of topic-specific gestures would drop.
It's true that about 10% of word occurrences (about 80% of the vocabulary, with numbers off the top of my head from memory) for large corpora are topic specific and are called content-carrying, since they can identify the topic of a document. However, I don't think there are that many gestures, and there is a great deal more reuse of gestures across topics.