next up previous
Next: User Tasks Up: The Patrol Task Previous: Apparatus

   
Location

User location often provides valuable clues to the user's context. For example, if the user is in his supervisor's office, he is probably in an important meeting and does not want to be interrupted for phone calls or e-mail except for emergencies. By gathering data over many days, the user's motions throughout the day might be modeled. This model may then be used to predict when the user will be in a certain location and for how long [9]. Such information is invaluable for network caching in case the user's wireless network does not provide coverage everywhere on a campus.

Today, most outdoor positioning is performed in relation to the Global Positioning System (GPS). Differential systems can obtain accuracies of less than one meter, and update rates of one second are common. However, indoor systems require different methods. Current systems such as active badges [21,15,4,10] and beacon architectures [6,14,18] require increased infrastructure for higher accuracy. This increased infrastructure implies increased installation and maintenance. However, in the Patrol task, we attempt to determine location based solely on the images provided by the Patrol hat cameras, which are fixed-cost on-body equipment.

The Patrol environment consists of 14 rooms that are defined by their strategic importance to the players. The rooms' boundaries were not chosen to simplify the vision task but are based on the long standing conventions of game play. The playing areas include hallways, stairwells, classrooms, and mirror image copies of these classrooms whose similarities and ``institutional'' decor make the recognition task difficult. However, four of the possible rooms have relatively distinct coloration and luminance combinations, though two of these are not often traveled.

Hidden Markov models (HMM's) were chosen to represent the environment due to their potential language structure and excellent discrimination ability for varying time domain processes. For example, rooms may have distinct regions or lighting that can be modeled by the states in an HMM. In addition, the previous known location of the user helps to limit his current possible location. By observing the video stream over several minutes and knowing the physical layout of the building, many possible paths may be hypothesized and the most probable chosen based on the observed data. Prior knowledge about the mean time spent in each area may also be used to weight the probability of a given hypothesis. HMM's fully exploit these attributes. A full review of HMM's is not appropriate here, but the reader should see [17,2,11] for HMM implementation details and tutorials.

As a first attempt, the mean colors of three video patches are used to construct a feature vector in real-time. One patch is taken from approximately the center of the image of the forward looking camera. The means of the red, green, blue, and luminance pixel values are determined, creating a four element vector. This patch varies significantly due to the continual head motion of the player. The next patch is derived from the downward looking camera in the area just to the front of the player and out of range of average hand and foot motion. This patch represents the coloration of the floors. Finally, since the nose is always in the same place relative to the downward looking camera, a patch is sampled from the nose. This patch provides a hint at lighting variations as the player moves through a room. Combined, these patches provide a 12 element feature vector.

Approximately 45 minutes of Patrol video were analyzed for this experiment. Processing occurs at 10 frames per second on an SGI O2. Missed frames are filled by simply repeating the last feature vector up to that point. The video is then subsampled to six frames per second to create a manageable database size for HMM analysis. The video is hand annotated using a VLAN system to provide the training database and a reference transcription for the test database. Whenever the player steps into a new area, the video frame number and area name are recorded. Both the data and the transcription are converted to Entropic's HTK [23] format for training and testing.

For this experiment, 24.5 minutes of video, comprising 87 area transitions, are used for training the HMMs. As part of the training, a statistical (bigram) grammar is generated. This ``grammar'' is used in testing to weight those rooms which are considered next based on the current hypothesized room. An independent 19.3 minutes of video, comprising 55 area transitions, are used for testing. Note that the computer must segment the video at the area transitions as well as label the areas properly.

Table 1 demonstrates the accuracies of the different methods tested. For informative purposes, accuracy rates are reported both for testing on the training data and the independent test set. Accuracy is calculated by

\begin{displaymath}Acc = \frac{N - D - S - I}{N} \end{displaymath}

where N is the total number of areas in the test set, D (deletions) is the number of area changes not detected, S (substitutions) is the number of areas falsely labeled, and I (insertions) is the number of area transitions falsely detected. Note that, since all errors are counted against the accuracy rate, it is possible to get large negative accuracies by having many insertions, as shown by several entries of the table.
 
Table: Patrol area recognition accuracy
method training set independent  
    test set  
1-state HMM 20.69% -1.82%  
2-state HMM 51.72% 21.82%  
3-state HMM 68.97% 81.82%  
4-state HMM 65.52% 76.36%  
5-state HMM 79.31% 40.00%  
Nearest Neighbor -400% -485.18%  
 

The simplest method for determining the current room is to determine the smallest Euclidean distance between a test feature vector with the means of the feature vectors comprising the different room examples in the training set. In actuality, the mean of 200 video frames surrounding a given point in time is compared to the room classifications. Since the average time spent within an area is approximately 600 video frames (or 20 seconds), this window should smooth the data such that the resulting classification shouldn't change due to small variations in a given frame. However, many insertions still occur, causing the large negative accuracies shown in Table 1.

Given the nearest neighbor method as a comparison, it is easy to see how the time duration and contextual properties of the HMM's improve recognition. Table 1 shows that the accuracy of the HMM system, when tested on the training data, improves as more states are used in the HMM. This results from the HMM's overfitting the training data. Testing on the independent test set shows that the best model is a 3-state HMM, which achieves 82% accuracy. The topology for this HMM is shown in Figure 4. In some cases accuracy on the test data is better than the training data. This effect is due to the grammar which limits the possible transitions between areas. Once a wrong turn has been made, the system can pass through many areas before converging again with the correct path. The longer the test path, the higher the potential for being misled for extended periods of time.

  
Figure: HMM topology for Patrol.
\begin{figure}
\centerline{
\epsfysize=.3in
\epsfbox{/mas/vision/users/testarne/mypapers/images/3state-hmm.ps}
}
\end{figure}

Accuracy is but one way of evaluating the methods. Another important attribute is how well the system determines when the player has entered a new area. Figure 5 compares the 3-state HMM and nearest neighbor methods to the hand-labeled video. Different rooms are designated by two letter identifiers for convenience. As can be seen, the 3-state HMM system tends to be within a few seconds of the correct transition boundaries while the nearest neighbor system oscillates between many hypotheses. Changing the size of the averaging window might improve accuracy for the nearest neighbor system. However, the constantly changing pace of the Patrol player necessitates a dynamically changing window. This constraint would significantly complicate the method. In addition, a larger window would result in less distinct transition boundaries between areas.

  
Figure: Typical detection of Patrol area transitions.
\begin{figure}
\centerline{
\epsfysize=1in
\epsfbox{/mas/vision/users/testarne/mypapers/images/patrol-transitions.ps}
}
\end{figure}

As mentioned earlier, one of the strengths of the HMM system is that it can collect evidence over time to hypothesize the player's path through several areas. How much difference does this incorporation of context make on recognition? To determine this, the test set was segmented by hand, and each area was presented in isolation to the 3-state HMM system. At face value this should be a much easier task since the system does not have to segment the areas as well as recognize them. However, the system only achieved 49% accuracy on the test data and 78% accuracy on the training data. This result provides striking evidence of the importance of using context in this task and hints at the importance of context in other user activities.

While the current accuracy rate of 82% is good, several significant improvements can be made. Optical flow or inertial sensors could limit frame processing to those times when the player is moving forward. This would eliminate much of the variation, often caused by stand-offs and firefights, between examples of moving through a room. Similarly, the current system could be combined with optical flow to compensate for drift in inertial trackers and pedometers. Windowing the test data to the size of a few average rooms could improve HMM accuracies as well. Additionally, instead of the average color of video patches, color histograms could be used as feature vectors. Finally, all these techniques could be applied to create an automatic map of a new building as the Patrol player explored it.


next up previous
Next: User Tasks Up: The Patrol Task Previous: Apparatus
Thad Starner
1998-09-22