Prediction in Image Space using Kalman Filter

Ramesh Raskar

Introduction

In an interactive head tracked display system, temporally adjacent frames are very similar when the viewpoint used to render them changes gradually. This frame to frame coherence can be exploited to avoid conventional rendering in most frames [Azuma95][Costella93].

More recently, graphics programmers have realized that majority of frames can be generated by performing an image warp to interpolate nearby rendered frames. [McMillan95][Mark96] Image warping and interpolation itself has been widely studied. [Wolberg]

There has also been work in "predicting". Using the prediction of various parameters, the graphics system computes a new picture for the next frame with reduced latency.

Kalman Filter has been used here at UNC-CS to predict the motion of user wearing HMD. [Welch 96]. Similarly many signal processing and filtering ideas have been used in predicting user motion.

However, except [Costella93] to some extent, there has not been a great focus on predicting at pixel level. If we want the graphics system to show very detailed say ray-traced image every frame, the prediction of user motion is not enough. The latency of the virtual environment turns out to be much smaller than time it would take to draw a detailed images.

My idea is to predict in image space. Motion of each pixel is predicted using history of its motion and information about the user head motion obtained from the head tracker. Assuming correspondence problem is solved for successive frames and optical flow can be computed, the technique attempts to improve the apparent frame rate.

Comparison with Other Systems

It is worthwhile to compare this technique with other current methods. The first set of methods reduce latency by image warping. In immersive system [McMillan95] precompute correspondences manually and then allow 6-DOF for the user by first reading users view and then image warping to achieve higher frame rate. [Mark96] avoid frame by frame correspondence by maintaining Z-buffer values. Regan and Pose [Regan94], and Talisman [Torborg96] both attempt to render part of the frame separately. Although all these system reduce the delay between user motion and frame update, all assume that rendering time is a much smaller part of the total latency (where latency is the sum of delay between finding user location and time required to present the image to the user). The question is, can we compute image for the next instant even before we find out where the user is going to be ?

In the second set of methods, the latency of the tracker system is reduced by predicting the user motion. [Welch96][Azuma95].

The two, image warping and prediction have not been used simultaneously. Part of the reason is prediction is usually done in user space and the image warping is obviously done in image space. To combine the two, prediction in image space and warping around the predicted points, is exactly the goal of this project.

System

Input to the system : frames of pretty images, head tracker data till time (i-1)
Output : frame for i'th frame
The intended system will work in the following way.

- generate a i-1 (relatively few) high quality images for successive frame, they may be spaced at not real time speed
- find the corresponding features to track optical flow
- use Kalman filter at each image feature to predict where the next screen position of this feature will be in frame i, using tracker data and optical flow
- warp the image around these points in real time

My Implementation

All three parts of the system appear to be computationally intensive. Correspondence and optical flow computation is known to be a non-trivial problem. Using hundreds of Kalman filters to predict screen position of each feature in next frame is also likely to slow down the operation. Image warping is a per-pixel operation is very expensive.

However, we can use various ideas to break down these parts into more manageable subparts. I used OpenGL on SGI to implement all three parts.

Optical Flow

The system deals with synthetic scenes. To find corresponding images of a world point, we can use the view frustum for the two successive frames. Thus optical flow for a given 3D feature is obtained by successive multiplication of projection matrix and viewing matrix. I used the very useful gluProject() and gluUnproject() commands in GL utilities to obtain this mapping.

Image warping

Before we decide which image features to chose, we must consider what type of image warping will be implemented. In synthetic scenes, straight lines and co-planar features dominate most of the scene. This fact has been used in many computer vision algorithms where edge-based correspondence matching is used whenever the algorithms deal with artificial scenes (e.g offices, indoors). I studied various methods of image warping and discussed with others. It appears that interpolating an image given displacement vectors for arbitrarily spaced control points in the image is a non-trivial problem. The most obvious way is to use the Nurbs texture evaluator available in OpenGL. This and many similar techniques have two basic problems.
They approximate the shift rather than interpolating it. Almost all need a well defined mesh or grid of control points to start with.

A close examination of the situation at hand however suggests us that any "smoothing" evaluators are not desired. Say, control points are located at the corners of a projected polygon. If the control points move we want the points inside the polygon to move in a affine way. Similarly one should not introduce unnecessary control points in the image with zero displacement. This is usually done to complete a mesh.

Moreover, we cannot afford a program that works on each pixel to implement this warp.

The best solution with the given hardware I found was texture warping. Given a set of control points S1 in image 1 that are displaced to S2 in image 2, the image is warped as follows :
1. Compute deluanay triangulation of image 1 with S1 points
2. Convert Image 1 into a u-v space texture map.
3. Assign these u-v co-ordinated to the the corresponding triangle vertices in S2.
4. Render these triangles in frame buffer to complete the second image
5. Goto 1 and repeat.
Thus a single Image 1 is fragmented into multiple texture mapped triangles. The triangles sum upto Image 2. How to put these triangles together so that it appears as a single image ? One easy solution is to render them planar on a rectangle which is visible face on in the view. However, the difficult part is how to treat this set texture mapped triangles as one unified texture for the next iteration.

I used the idea of reading pixels directly from frame buffer and assigning it to texture memory. This is a two pass algorithm, First the segmented triangles are rendered in frame buffer and displayed. In the second step, the framebuffer is read into texture memory and considered as texture for next operations. This a slightly slow process. But since it is done in hardware, it is expected to be much faster than per-pixel software solution. Note that is is a linear transformation achieved by translation, scaling(including shear) and rotation which is exactly what is achieved by texture mapping. (although without considering the perspective distortion.)

Kalman Filter

This is a main part of this project. However due to lack of time I managed to implement only a linear model in which acceleration is considered white Gaussian noise. A Kalman filter is assigned to each pre-determined feature point. A simple program can identify all the corners and areas of high intensity changes in the 3D world. 2D mappings of these features are considered again features or control points in screen space. The program I have written can finitely many feature points to start with. For the time being I consider identification of feature points and assignment of Kalman filters a static process. As I develop this system further, I will consider dynamic inclusion and exclusion of feature points.

The other important part missing from my implementation is use of tracker data to tune the Kalman filter. Note that even a simple linear Kalman filter is better than estimator based on inertia because Kalman filter computes the best guess for velocity, it is truly the most optimal filter [Brown]

How to introduce the tracker data in the filter update process is actually the most exciting part of the project. It is not very obvious how input from a 6DOF tracker can unambiguously influence motion of feature points. However, with the newly developed SCAAT (Single constraint at a time [Welch96]) system is suitable for such underdetermined systems. For example, a use motion to the left suggests a motion to the right for every control point. But a user motion to the from does not really tell us how control points will be displaced (since it is depended on z-values). I hope to improve the system during the next month.

Results and images

References

COOK, Robert, and TORRANCE, Kenneth. A Reflectance Model for Computer Graphics. In Computer Graphics (Siggraph proceedings), vol. 15, no. 3, August 1981, pp. 244-253.
[AZUMA95] Azuma R, Bishop G A frequency domain analysis of head-motion prediction . In Computer Graphics (Siggraph proceedings) , August 1995
[COSTELLA93] Costella, J P Motion extrapolation at the pixel level, unpublished paper available from . http://www.ph.unimelb.edu.au/~jpc , Jan 1993
[Mark96] William R. Mark, Gary Bishop, Leonard McMillan Post-Rendering Image Warping for Latency Compensation, UNC Tech Report 1996.
[McMillan95] McMillan L, Bishop G Image based rendering system . In Computer Graphics (Siggraph proceedings) , August 1995
[Regan94] Regan M , Pose R Priority rendering with a virtual reality address recalculation pipeline . In Computer Graphics (Siggraph proceedings) , August 1994
[Torborg96] Tororg J, Kajiya J Talisman : Commodity realtime 3D graphics for PC . In Computer Graphics (Siggraph proceedings) , August 1996
[Welch96] Welch G SCAAT : Single constraint at a time system . In PhD Thesis , August 1996