Today, most outdoor positioning is performed in relation to the Global Positioning System (GPS). Differential systems can obtain accuracies of less than one meter, and update rates of one second are common. However, indoor systems require different methods. Current systems such as active badges [21,15,4,10] and beacon architectures [6,14,18] require increased infrastructure for higher accuracy. This increased infrastructure implies increased installation and maintenance. However, in the Patrol task, we attempt to determine location based solely on the images provided by the Patrol hat cameras, which are fixed-cost on-body equipment.
The Patrol environment consists of 14 rooms that are defined by their strategic importance to the players. The rooms' boundaries were not chosen to simplify the vision task but are based on the long standing conventions of game play. The playing areas include hallways, stairwells, classrooms, and mirror image copies of these classrooms whose similarities and ``institutional'' decor make the recognition task difficult. However, four of the possible rooms have relatively distinct coloration and luminance combinations, though two of these are not often traveled.
Hidden Markov models (HMM's) were chosen to represent the environment due to their potential language structure and excellent discrimination ability for varying time domain processes. For example, rooms may have distinct regions or lighting that can be modeled by the states in an HMM. In addition, the previous known location of the user helps to limit his current possible location. By observing the video stream over several minutes and knowing the physical layout of the building, many possible paths may be hypothesized and the most probable chosen based on the observed data. Prior knowledge about the mean time spent in each area may also be used to weight the probability of a given hypothesis. HMM's fully exploit these attributes. A full review of HMM's is not appropriate here, but the reader should see [17,2,11] for HMM implementation details and tutorials.
As a first attempt, the mean colors of three video patches are used to construct a feature vector in real-time. One patch is taken from approximately the center of the image of the forward looking camera. The means of the red, green, blue, and luminance pixel values are determined, creating a four element vector. This patch varies significantly due to the continual head motion of the player. The next patch is derived from the downward looking camera in the area just to the front of the player and out of range of average hand and foot motion. This patch represents the coloration of the floors. Finally, since the nose is always in the same place relative to the downward looking camera, a patch is sampled from the nose. This patch provides a hint at lighting variations as the player moves through a room. Combined, these patches provide a 12 element feature vector.
Approximately 45 minutes of Patrol video were analyzed for this experiment. Processing occurs at 10 frames per second on an SGI O2. Missed frames are filled by simply repeating the last feature vector up to that point. The video is then subsampled to six frames per second to create a manageable database size for HMM analysis. The video is hand annotated using a VLAN system to provide the training database and a reference transcription for the test database. Whenever the player steps into a new area, the video frame number and area name are recorded. Both the data and the transcription are converted to Entropic's HTK [23] format for training and testing.
For this experiment, 24.5 minutes of video, comprising 87 area transitions, are used for training the HMMs. As part of the training, a statistical (bigram) grammar is generated. This ``grammar'' is used in testing to weight those rooms which are considered next based on the current hypothesized room. An independent 19.3 minutes of video, comprising 55 area transitions, are used for testing. Note that the computer must segment the video at the area transitions as well as label the areas properly.
Table 1 demonstrates the accuracies of the different
methods tested. For informative purposes, accuracy rates are reported
both for testing on the training data and the independent test set.
Accuracy is calculated by
The simplest method for determining the current room is to determine the smallest Euclidean distance between a test feature vector with the means of the feature vectors comprising the different room examples in the training set. In actuality, the mean of 200 video frames surrounding a given point in time is compared to the room classifications. Since the average time spent within an area is approximately 600 video frames (or 20 seconds), this window should smooth the data such that the resulting classification shouldn't change due to small variations in a given frame. However, many insertions still occur, causing the large negative accuracies shown in Table 1.
Given the nearest neighbor method as a comparison, it is easy to see
how the time duration and contextual properties of the HMM's improve
recognition. Table 1 shows that the accuracy of
the HMM system, when tested on the training data, improves as more
states are used in the HMM. This results from the HMM's
overfitting the training data. Testing on the independent test
set shows that the best model is a 3-state HMM, which achieves 82%
accuracy. The topology for this HMM is shown in Figure
4. In some cases accuracy on the test data is
better than the training data. This effect is due to the grammar
which limits the possible transitions between areas. Once a wrong
turn has been made, the system can pass through many areas before
converging again with the correct path. The longer the test path, the
higher the potential for being misled for extended periods of time.
Accuracy is but one way of evaluating the methods. Another important
attribute is how well the system determines when the player has
entered a new area. Figure 5 compares the 3-state
HMM and nearest neighbor methods to the hand-labeled video. Different
rooms are designated by two letter identifiers for convenience. As
can be seen, the 3-state HMM system tends to be within a few seconds
of the correct transition boundaries while the nearest neighbor system
oscillates between many hypotheses. Changing the size of the
averaging window might improve accuracy for the nearest neighbor
system. However, the constantly changing pace of the Patrol player
necessitates a dynamically changing window. This
constraint would significantly complicate the method. In addition, a
larger window would result in less distinct transition boundaries
between areas.
As mentioned earlier, one of the strengths of the HMM system is that it can collect evidence over time to hypothesize the player's path through several areas. How much difference does this incorporation of context make on recognition? To determine this, the test set was segmented by hand, and each area was presented in isolation to the 3-state HMM system. At face value this should be a much easier task since the system does not have to segment the areas as well as recognize them. However, the system only achieved 49% accuracy on the test data and 78% accuracy on the training data. This result provides striking evidence of the importance of using context in this task and hints at the importance of context in other user activities.
While the current accuracy rate of 82% is good, several significant improvements can be made. Optical flow or inertial sensors could limit frame processing to those times when the player is moving forward. This would eliminate much of the variation, often caused by stand-offs and firefights, between examples of moving through a room. Similarly, the current system could be combined with optical flow to compensate for drift in inertial trackers and pedometers. Windowing the test data to the size of a few average rooms could improve HMM accuracies as well. Additionally, instead of the average color of video patches, color histograms could be used as feature vectors. Finally, all these techniques could be applied to create an automatic map of a new building as the Patrol player explored it.