Research in visual surveillance is quickly approaching the area of complex activities, which are framed by extended context. As more and more methods of identifying simple movements become available, the importance of the contextual methods increases. In such approaches, activities and movements are not only recognized at the moment of detection, but their interpretation and labeling is affected by the temporally extended context in which the events take place (e.g., see [2]). For example, in our application, while observing objects of different classes, cars and people, the detection of the class of an object may be uncertain. However, if when participating in an interaction the object behaves like a car, entering interactions with other objects in a way characteristic to a car, then belief about its class label is reinforced.
The monitoring system we describe in this paper is an example of an end-to-end implementation, which is adaptive to the physical features of the monitored environment and exhibits certain contextual awareness. Adaptation to the environment is achieved by a tracker based on Adaptive Background Mixture Models ([16]). It robustly tracks separate objects in environments with significant lighting variation, repetitive motions, and long-term scene changes. The tracking sequences it produces are probabilistically classified and mapped into a set of pre-determined discrete events. These events correspond to objects of different types performing different actions.
Contextual information in the system is propagated via a stochastic parsing mechanism. Stochastic parsing handles noisy and uncertain classifications of the primitive events by integrating them into a coherent interpretation. Parallel input streams and consistency checking allows for detection of high level interactions between multiple objects. The system demonstrates integration of low level visual information with high level structural knowledge, which serve as contextual constraints. The system is capable of maintaining concurrent interpretations when multiple activities are taking place simultaneously. Furthermore, the system allows for interpretation of activities involving multiple objects, such as interactions between cars and people during PICK-UP and DROP-OFF.
The remainder of this paper is organized as follows: section 2 introduces the problem domain and gives a brief overview of previous and ongoing research in building automated surveillance systems. Section 3 gives an overview of the system and details of its implementation. The section describes the system components, a tracker (section 3.1), an event generator (section 3.2) and the parser (section 3.3) presenting the level of detail necessary for general understanding 5. Results of the system working on the surveillance data are shown in section 4, which is followed by conclusions, presented in section 5.