The work presented in this paper spans a broad range from low level vision algorithms to high level techniques used for natural language understanding.
The tracking component of the system uses an adaptive background mixture model, which is similar to that used by Friedman and Russell ([10]). Their method attempts to explicitly classify the pixel values into three separate, predetermined distributions corresponding to the color of the road, the shadows, and the cars. Unfortunately, it is not clear what behavior their system would exhibit for pixels which did not contain these three distributions.
Pfinder ([18]) uses a multi-class statistical model for the tracked objects, but the background model is a single Gaussian per pixel. After an initialization period without objects, the system reports good results indoors. Ridder et al. ([15]) modeled each pixel with a Kalman Filter which made their system more robust to lighting changes in the scene. While this method does have a pixel-wise automatic threshold, it still recovers slowly and does not handle bimodal backgrounds well.
Contextual labeling in our system is performed by a stochastic parser, which is derived from that developed by Stolcke in [17], as was previously described in [2]. We extended standard Stochastic Context-Free Grammar (SCFG) parsing to include (1) uncertain input symbols, and (2) temporal interval primitives that need to be parsed in a temporally consistent manner. In order to allow for noisy input we used the robust grammar approach, similar to that of Aho and Peterson ([1]). We extended the SCFG parser to accept a multi-valued input strings which allow for correction of the substitution errors, which is similar to the work done by Bunke([6]).
[8] is fully devoted to syntactic analysis of interactions and cooperative deterministic processes. It is related to our solution, formulating problems similar to ours, which are cast in terms of cooperative grammatical systems. In contrast, we describe interactions by a single stochastic grammar and use a single parser in an attempt to avoid the computational complexity of the complete Cooperative Distributed (CD) Grammar Systems.
In the area of monitoring long term complex activities, Courtney ([7]) developed a system, which allows for detection activities in a closed environment. The activities include person leaving an object in a room, or taking it out of the room. Perhaps the most complete general solution is described in Brill at al. ([5], who are working on an Autonomous Video Surveillance system. Brand ([3]) showed the results of detecting manipulations in video using a non-probabilistic grammar. This technique is non-probabilistic and requires relatively high quality low-level detectors.
Oliver and Rosario ([14]) developed a system for detecting people interactions, which modeled interactions using Coupled Hidden Markov Model ([4]). In the course of the latter research, a multi-agent simulation was used to produce synthetic training data to train the CHMM modeling the interaction. The relation to our work is that their representation of the multi-agent simulation can be viewed as a structured, stochastic grammar-like description of the interactions.