For both actions (aiming/shooting and reloading) we train a separate HMM containing 5 states. In order to train the HMM's we annotated 2 minutes of the video data. These 2 minutes contained 13 aiming/shooting actions and 6 reloading actions. Everything which is neither aiming nor shooting is modeled by a third class, the ``other'' class (10 sequences in total). These actions (aiming, reloading and ``other'') have been separated into a training set of 7 aiming actions, 4 reloading actions and 3 other sequences for training of the HMM's. Interestingly the actions are of very different length (between 2.25sec and 0.3sec). The remaining actions have been used as test set. Table 2 shows the confusion matrix of the three action classes.
|
Aiming is relatively distinctive with respect to reloading and ``other'', since the arm is stretched out during aiming, which is probably the reason for the perfect recognition of the aiming sequences. However, reloading and ``other'' are difficult to distinguish, since the reloading action happens only in a very small region of the image (close to the body) and is sometimes barely visible.
These preliminary results are certainly encouraging, but have been obtained for perfectly segmented data and a very small set of actions. However, one of the intrinsic properties of HMM's is that they can deal with unsegmented data. Furthermore the increase of the task vocabulary will enable the use of language and context models which can be applied on different levels and which will help the recognition of single tasks.