The tracker used in our system is based on a adaptive mixture of Gaussians technique described in detail in [16]. In this approach each pixel is modeled by a separate mixture of K Gaussians as follows:
where
is an estimate of the ith mixture coefficient
for time t, Xt is the current pixel value, and
and
are the parameters of the corresponding component.
If the current pixel value, Xt, is found to be well modeled by one
of the mixture components (Xt is within 2.5 standard deviations
from the mean), the weights
,
and parameters of the
corresponding component are re-estimated.
If the former is not true, the least likely component of the mixture
is replaced by a new one, with the mean
set to Xt and
high initial variance,
.
The next step is to determine if the pixel Xt belongs to the
background. In order to do that, we sort all the components in the
mixture in the order of decreasing ratio
.
This ratio, effectively assigns higher importance to the
mixture components that received the most evidence and have the
lowest variance. The intuitive meaning of this ratio is that the
components which correspond to background typically have more
observations attributed to them and those observations vary little.
Then, after the components are sorted, we can set a threshold, T, which will separate components responsible for background pixels from the ones modeling foreground as follows:
where the meaning of the value B is that the first B components of the sorted mixture are found ``responsible'' for background. Now, if the pixel Xt is best modeled by one of the ``background'' components, it is marked as belonging to the background.
Finally, foreground pixels are segmented into regions by a two-pass, connected components algorithm.
Establishing correspondence of foreground regions between frames is accomplished using a linearly predictive multiple hypotheses tracking algorithm which incorporates both region position and size. We have implemented an on-line method for seeding and maintaining sets of Kalman filters, modeling the dynamics of foreground regions. Details of this process can be found in [16]. Essentially, for each frame, the parameters of the existing dynamical models are estimated; those models are used to explain observed foreground regions, and, finally, new models are hypothesized based on foreground regions which were not explained by any existing model.
Our system adapts to robustly deal with lighting changes, repetitive motions of scene elements, tracking through cluttered regions, slow-moving objects, and introducing or removing objects from the scene. Slowly moving objects take longer to be incorporated into the background, because their color has a larger variance than the background. Also, repetitive variations are learned, and a model for the background distribution is generally maintained even if it is temporarily replaced by another distribution which leads to faster recovery when objects are removed.