Ramesh Raskar
In an interactive head tracked display system, temporally adjacent frames are very similar when the viewpoint used to render them changes gradually. This frame to frame coherence can be exploited to avoid conventional rendering in most frames [Azuma95][Costella93].
More recently, graphics programmers have realized that majority of frames can be generated by performing an image warp to interpolate nearby rendered frames. [McMillan95][Mark96] Image warping and interpolation itself has been widely studied. [Wolberg]
There has also been work in "predicting". Using the prediction of various parameters, the graphics system computes a new picture for the next frame with reduced latency.
Kalman Filter has been used here at UNC-CS to predict the motion of user wearing HMD. [Welch 96]. Similarly many signal processing and filtering ideas have been used in predicting user motion.
However, except [Costella93] to some extent, there has not been a great focus on predicting at pixel level. If we want the graphics system to show very detailed say ray-traced image every frame, the prediction of user motion is not enough. The latency of the virtual environment turns out to be much smaller than time it would take to draw a detailed images.
My idea is to predict in image space. Motion of each pixel is predicted using history of its motion and information about the user head motion obtained from the head tracker. Assuming correspondence problem is solved for successive frames and optical flow can be computed, the technique attempts to improve the apparent frame rate.
It is worthwhile to compare this technique with other current methods. The first set of methods reduce latency by image warping. In immersive system [McMillan95] precompute correspondences manually and then allow 6-DOF for the user by first reading users view and then image warping to achieve higher frame rate. [Mark96] avoid frame by frame correspondence by maintaining Z-buffer values. Regan and Pose [Regan94], and Talisman [Torborg96] both attempt to render part of the frame separately. Although all these system reduce the delay between user motion and frame update, all assume that rendering time is a much smaller part of the total latency (where latency is the sum of delay between finding user location and time required to present the image to the user). The question is, can we compute image for the next instant even before we find out where the user is going to be ?
In the second set of methods, the latency of the tracker system is reduced by predicting the user motion. [Welch96][Azuma95].
The two, image warping and prediction have not been used simultaneously. Part of the reason is prediction is usually done in user space and the image warping is obviously done in image space. To combine the two, prediction in image space and warping around the predicted points, is exactly the goal of this project.
Input to the system : frames of pretty images, head tracker data
till time (i-1)
Output : frame for i'th frame
The intended system will work in the following way.
All three parts of the system appear to be computationally intensive. Correspondence and optical flow computation is known to be a non-trivial problem. Using hundreds of Kalman filters to predict screen position of each feature in next frame is also likely to slow down the operation. Image warping is a per-pixel operation is very expensive.
However, we can use various ideas to break down these parts into more manageable subparts. I used OpenGL on SGI to implement all three parts.
Optical Flow
Image warping
They approximate the shift rather than interpolating it. Almost all need a well defined mesh or grid of control points to start with.
A close examination of the situation at hand however suggests us that any "smoothing" evaluators are not desired. Say, control points are located at the corners of a projected polygon. If the control points move we want the points inside the polygon to move in a affine way. Similarly one should not introduce unnecessary control points in the image with zero displacement. This is usually done to complete a mesh.
Moreover, we cannot afford a program that works on each pixel to implement this warp.
The best solution with the given hardware I found was texture warping. Given a set of control points S1 in image 1 that are displaced to S2 in image 2, the image is warped as follows :
Thus a single Image 1 is fragmented into multiple texture mapped triangles. The triangles sum upto Image 2. How to put these triangles together so that it appears as a single image ? One easy solution is to render them planar on a rectangle which is visible face on in the view. However, the difficult part is how to treat this set texture mapped triangles as one unified texture for the next iteration.
I used the idea of reading pixels directly from frame buffer and assigning it to texture memory. This is a two pass algorithm, First the segmented triangles are rendered in frame buffer and displayed. In the second step, the framebuffer is read into texture memory and considered as texture for next operations. This a slightly slow process. But since it is done in hardware, it is expected to be much faster than per-pixel software solution. Note that is is a linear transformation achieved by translation, scaling(including shear) and rotation which is exactly what is achieved by texture mapping. (although without considering the perspective distortion.)
Kalman Filter
This is a main part of this project. However due to lack of time I managed to implement only a linear model in which acceleration is considered white Gaussian noise. A Kalman filter is assigned to each pre-determined feature point. A simple program can identify all the corners and areas of high intensity changes in the 3D world. 2D mappings of these features are considered again features or control points in screen space. The program I have written can finitely many feature points to start with. For the time being I consider identification of feature points and assignment of Kalman filters a static process. As I develop this system further, I will consider dynamic inclusion and exclusion of feature points.
The other important part missing from my implementation is use of tracker data to tune the Kalman filter. Note that even a simple linear Kalman filter is better than estimator based on inertia because Kalman filter computes the best guess for velocity, it is truly the most optimal filter [Brown]
How to introduce the tracker data in the filter update process is actually the most exciting part of the project. It is not very obvious how input from a 6DOF tracker can unambiguously influence motion of feature points. However, with the newly developed SCAAT (Single constraint at a time [Welch96]) system is suitable for such underdetermined systems. For example, a use motion to the left suggests a motion to the right for every control point. But a user motion to the from does not really tell us how control points will be displaced (since it is depended on z-values). I hope to improve the system during the next month.