next up previous
Next: Conclusions Up: Video Surveillance of Interactions Previous: Parser


Experimental Results

Here we show results of the system run on a data collected on a parking lot at Carnegie Mellon University. The system runs in real time processing data from a live video feed or a video tape. The tracker and the event generator run on an 175 MHz R10000 SGI O2 machine. The parser runs on an 200 MHz R4400 SGI Indy.

The tracker runs at approximately 12 fps on 160x120 images. It generally exhibited unbroken tracks except in cases of occlusions and extreme lighting changes. The events were mapped using a hand-coded, probabilistic classifier for object type (e.g. car or person), which used the aspect ratio of the object.

The parser requires the interaction structure described to it in terms of Stochastic Context Free Grammar. A partial listing of the grammar employed by our system for the parking lot monitoring task is shown in figure 3. Labels in capitals are the non-terminals while the terminals, or primitives, are written in small letters. Square brackets enclose probabilities associated with each production rule. These probabilities reflect the typicality of the corresponding production rule and the sequence of primitives, which it represents.

The high-level non-terminals (CAR-THROUGH, PERSON-THROUGH, PERSON-IN, CAR-OUT, CAR-PICK and DROP-OFF) have associated semantic action blocks associated with them, which are not shown in the figure for brevity. Each such action is a simple script which outputs the corresponding label (such asDROP-OFF), and all the available data, related to the non-terminal (e.g. starting and ending video frame or time- stamp). The semantic action is invoked when the final state is reached and the resulting maximum probability parse includes the corresponding non-terminal.

The production rule probabilities have been manually set to plausible values for this domain. Learning these probabilities is an interesting problem, which is planned for future work. However, our observations showed that the grammatical and spatial consistency requirements eliminate the majority of incorrect interpretations. This results in our system being quite insensitive to the precise values of these probabilities.

Figure 3: A DROP-OFF branch of a simplified grammar describing interactions in a parking lot.
\begin{figure}
{\small {\tt\begin{verbatim}TRACK: CAR-TRACK [.5]
\vert PERSON-T...
...OST: person-lost [.7]
\vert SKIP person-lost [.3]\end{verbatim}}
}
\end{figure}

The test data consisted of approximately 15 minutes of video, showing several high level events such as drop-off and pick-up. The events were staged in the real environment, where the real traffic was present concurrently with the staged events. The only reason for staging the events was to have more examples within 15 minutes of video. The drop-offs and pick-ups were performed by people unfamiliar with the system. The resulting parses were output in the real time. In figures 5 a) - e) we show a sequence of 5 consecutive detections of high level events. The sequence shown in the figure, demonstrates the capability of the system to parse concurrent activities and interactions. The main event in this sequence is the DROP-OFF. While monitoring this activity, the system also detected unrelated high level events: 2 instances of CAR-THROUGH and a PERSON-THROUGH event. The figure 5 f) shows the temporal extent of activities, shown iconically in figures 5 a)-e).

Figure 4: Results of track mapping on one of the runs of the system. Two subsets of events, outlined in the picture, correspond to DRIVE-IN and DROP-OFF. Interpretation of this data is shown in figure 5.
\begin{figure}
\psfig{figure=Figures/Server_labelCut.eps,width=6.5in}\end{figure}

Figure 5: a) A car passed through the scene, while DROP-OFF was performed. Corresponding track is shown by a sequence of white pixels. b) Person passing through. c) A person left the car and exited the scene. At this moment the system has enough information to emit the DRIVE-IN label. d) The car leaves the scene. The conditions for DROP-OFF are now satisfied and the label is emitted. e) Before the car performing the DROP-OFF exits the scene, it yields to another car passing through, which is shown here. f) Temporal extent of the actions shown in a)-e). Actions related to people are shown in white. Top line of the picture corresponds to the label a), the bottom one - e). Car primitives are drawn in black. The figure clearly demonstrates concurrency of events. In this figure, primitive events are abbreviated as follows: ce - car-enter, cs - car-stop, cx - car-exit, pe - person-enter, pf - person-found, px - person-exit.
\begin{figure}
\psfig{figure=Figures/FullCaptureAnnotated.eps,width=6.5in}\end{figure}

All of these parses can be traced down to the primitives, which hold the track data. Consequently, the complete track can be reconstructed, as shown by white traces in figures 5 a) - e). In the longest segment of video, the event generator produces between 150 and 200 events; the exact count depends upon the reaction of the tracker to video noise. After tuning the environment map used by the event generator to convert tracks to events, all the high level interactions were correctly detected.


next up previous
Next: Conclusions Up: Video Surveillance of Interactions Previous: Parser
yuri ivanov
1999-02-05