Deep Learning Alchemy to close the Perception Loop in Vision: Part 1 - Image Alignment with Pytorch

Tristan Swedish - 12 Mar 2018

Computer Vision has undergone a revolution of sorts with the widespread adoption of Deep Learning methods. While this is an incredibly exciting and active area of research, I believe that Deep Learning is necessary but not sufficient to fully solve vision.

Deep Learning has trouble with some types of alignment and planning problems. These include things like knowing how you’ve moved around in the world or what’s the best way to move your body to avoid running into objects. These tasks generally require an iterative solver to come up with increasingly refined solutions, and traditional methods that don’t use deep learning at all remain state of the art.

How we (Deep Learning People) think about perception today.

It’s not clear if an entirely end-to-end approach that trains a big Neural Network using tons of input/output pairs will ever catch up. Of course, we could talk about RNNs or other Deep Learning architectures that could maybe learn iterative refinement, or even Reinforcement Learning methods that may uncover feedback based refinement rules. While these methods may turn out to be the best way to implement these ideas, it’s benefifical to conceptually separate the idea of the “intuition” and the “solver” part of AI systems.

With this distinction, we can think about optimization loops that operate in different contexts. This separation of optimization or feedback loops characterizes many modern cognitive models that attempt to explain animal (including human) behavior. It’s this loopy, ad-hoc, engineering that I term “Deep Learning Alchemy.” It’s liberating to mix the best of both worlds, and an area I hope to explore in this blog a bit more.

Deep Learning Alchemy for Perception. Multiple optimization loops in different contexts, here shown with different colors. In this post, we focus mainly on the green loop, but the system we design considers the slower blue and orange loops as well.

The motivating example used in this series is the problem of automatically estimating the motion of a single camera as it moves through the world. This is called “monocular visual odometry” and has applications to Robotics, Augmented/Mixed/Virtual Reality, 3D games and graphics, as well as things like image stabilization.

Part 1 - Image Alignment with Pytorch

In Part 1, our goal is to make Pytorch do all the heavy lifting. Pytorch is a popular deep-learning library, but it also can do much more. In this post we’ll make an automatic image alignment algorithm. Here is our end result:

The final result showing image alignment for two images in the red and green channels.

The target image is in the red channel, and the aligned image is in the green channel, well aligned images appear yellow. The six numbers at the bottom are the different parameters (3D rotation and translation respectively) we’ll estimate automatically.

There have been a few papers published that inspired me quite a bit, specifically please refer to this work from Jakob Engel in Prof Cremers’ group at TUM: LSD-SLAM and DSO. We’ll start from scratch in a way though, and implement everything in python.

6-DOF Image Alignment for Monocular Visual Odometry

Let’s just dive right into the 6 degrees of freedom (DOF) case. Given one image, imagine how other images would look if we moved the camera around a little bit. Concretely, take this image:

Source Image

Now, on a piece of paper, try sketching out how the scene would appear to change if you moved forward, backward, or side to side. You can keep it super simple, here’s what I came up with:

Imagined perspective after moving around

Now, let’s look at another image:

Target Image

What sketch matches up with this image? If you match this to the imagined motion, does this make sense? To me, it looks like the new image most closely matches the imagined “backward” image. Do you agree?

Here’s the crux of our approach:

We will estimate motion by warping a source image to match a target image as closely as possible.

This image matching is called 6-DOF image alignment. This means that 6-DOF image alignment allows us to estimate the full pose change of the camera for each captured image. Knowing the pose change gives us the camera motion between the two images!

The image formation model

We want to align images, so let’s talk briefly about how we form an image in the first place. The simplest way to describe how a camera works mathematically is called the “pinhole camera” model: 1

We call “” our image formation model because it converts points in the world, , which exist in 3D space into coordinates for each pixel in an image. We have which captures the camera focal length, the perspective projection , and rotation matrix and translation vector .

Let’s take a step back and check out the equation above in picture form:

Illustration of pinhole cameras and the source and target images in 3D space

This means that if we know , we can generate images from different perspectives by changing and , which correspond to a 6-DOF coordinate transformation. We parameterize with six numbers, , as described later.

The Forward Warp

So let’s implement . An overview of what we’ll want to do is as follows:

  1. Estimate Depth: Estimate the scene depth at each pixel in the image.

  2. Backprojection: Produce a set of points in 3D ( in equation above) by projecting each pixel in the image out according to its depth.

  3. Coordinate Transform: Transform this set of points using and being sure to keep track which points correspond to each pixel in the original image.

  4. Remap: Re-project this set of points onto a new image plane.

We’ll implement this in Pytorch, a sketch of the code is here: code. This gist doesn’t contain the iteration loop for refinement, I plan to release a more friendly python package once this post series ends that can be run out of the box. For now, the code is provided as an example only, you would need to provide the optimization heuristics yourself.

1. Estimating Depth

Let’s assume we know something about the depth of the scene. For now, I’ll use a pre-trained Deep Neural Network 2 that makes a per-pixel depth prediction. Think of this as a prediction of depth learned from many examples, and encodes heuristics like “walls are generally straight” and “chairs have a somewhat standard size”.

Predicted depth by neural network. Light regions are further away.

Depth has been mapped to pixel values in the image above, with the fully white pixels equal to 30 meters. This is a big deal, since we can ignore some of ugly details like scale ambiguities in our estimates. This is a perfect application of using machine learning for useful inference, and greatly simplifies things for us.

2. Back-projection

With depth, we can generate a set of points in 3D by projecting them out into the world to the known depth distance from our camera center. Every pixel in our original image corresponds to a single ray direction, so we just have to follow that ray path for the length indicated by our predicted depth.

# x_hat are 3D unit direction vectors for each pixel location u,v
x = x_hat * depth.reshape(1,h*w)

We’re now fully in 3D land, I used the original image and depth map to visualize the setup. 3

We can visualize the 3D structure of the scene after estimating the per-pixel depth

3. Coordinate Transformation

Next we want to transform our points using and . We will define this transformation using Pytorch.

# Earlier code establishes x and p as torch Variables, R derived from p
# (see Appendix: r = p[:3])
x2 =,x) + p[3:6].expand_as(x)

We know that 6DOF means we only need 6 numbers, and 3 of those correspond to translation. But is a rotation matrix with 9 numbers. These 9 numbers are certainly related to each other in some way, right?

We can find the rotation matrix using Rodrigues’ Formula. We define a 3D vector whose direction points along an axis of rotation and magnitude cooresponds to the angle in radians to rotate along that axis. This is called the “angle-axis” representation of rotation. It turns out we can define any rotation we’d like this way! 4

Click for details on Rodrigue's Formula

Rodrigue's Formula

The definition is as follows: $$ \theta = |r| $$ $$ \hat{r} = \frac{r}{|r|} $$ $$ R = \cos(\theta) I + (1 - \cos(\theta) \hat{r}\hat{r}^\intercal) + \sin(\theta) [\hat{r}]_{\times} $$ There's some notation that may be new to you. First, \(\hat{r}\hat{r}^\intercal\) is the outer-product of \(\hat{r}\), which forms a \(3 \times 3\) matrix. Also, \([\hat{r}]_{\times}\) is another \(3 \times 3\) matrix and is called the skew-symmetric matrix of \(\hat{r}\), which performs a cross product on right multiplies, like \(\hat{r} \times v\). You'll notice that any vector \(r\) will give us a valid rotation matrix, this is pretty awesome for reasons you'll see later!

Given 6 parameters “p” associated with and , and a set of points in 3D “x”, we calculate the points in new coordinates after transformation.

4. Remap

After performing our coordinate transformation, we then project the points back onto a plane using the standard pinhole camera projection . scales each point position in and by it’s inverse distance from the camera.

# K represents the intrinsic camera matrix (focal length)
x3 =,x2)
piv = x3[1]/x3[2]
piu = x3[0]/x3[2]
accum = cv2.remap(img,piu,piv,cv2.INTER_LINEAR)

In this final projection step, we calculate the new image coordinate of the source image pixel, for “u” and “v”.

By warping pixel coordinates defined by meshgrid, we know the image coordinates of each source pixel on our new image. We use OpenCV’s “cv2.remap()” function to perform the image warp. We do this because “remap()” is super fast, and it performs interpolation for us automatically. The problem is that remap also doesn’t properly render occluded sections of our scene, or allow for “holes” that should form for warps that reveal occluded parts of the scene. For now, let’s consider these minor effects for small pose changes. 5

Now, we have a newly warped image:

Warped image as if moving camera backward 1.5 meters.

Here we’ve warped the input image above to appear as if moving the camera backward. You’ll notice the black areas are regions where there was no visual information. The image is warped in a very irregular way, due to the change in perspective and scene geometry.

Closing the Loop: Solving Alignment Automatically

Now that we can perform 6DOF image warping, let’s try to do this automatically!

First, we need to define our problem, following a classic approach, we write our image alignment problem like so:

We take world points from the perspective to one of our images, perform a coordinate transformation, and then project those points onto an image plane. We want the intensity values of our transformed image to closely match our other image , so we minimize the square error by finding the coordinate transform defined by 6 parameters .

To visualize this more concretely, let’s combine the two images above. Here, I’ve put each image into a different color channel (red and green respectively), similar pixel values should appear yellow:

Starting alignment is not very accurate.

It doesn’t line up very well, but we can refine our estimate by solving the minimization problem defined above.

So how do we solve this minimization problem?

This turns out to be a classic non-linear least squares problem. Typically, we solve these problems by “Linearizing” them. Linearizing means that we try and find a plane that approximates our function for a given input. I’ve included more details in the pop-down below that lays out a rough derivation, but you actually don’t need to do any math to get this working. 6

Click for details on non-linear least squares

Non-linear Least Squares

The natural way to approximate a function at a point is to determine its Taylor Approximation. Taylor Polynomial Approximation represents functions to increasingly higher orders by taking derivatives. The first order approximation is linear by definition, and is the sum of the function value at the point and it's first derivative. That brings us to the linearized version of our warped image function above: $$ I(W(x,a)) \approx I(W(x,\theta)) + \nabla I \frac{\partial W}{\partial \theta }(a - \theta) $$ With such an approximation, we can estimate what our function value would look like if we made a small change to the parameters. Let's set \(a = \theta + \epsilon_\theta\) to remind us that we are interested in a small region (sometimes called the \(\epsilon\)-ball) around \(\theta\). $$ ([I(W(x,\theta)) + J_{\theta} \epsilon_\theta] - I_0(u,v))^2 $$ Now, we can minimize our approximation by taking the derivative and setting the expression to zero. We then rearrange to solve for \(\epsilon_\theta\). $$ \frac{\partial}{\partial \theta} ([I(W(x,\theta)) + J_{\theta} \epsilon_\theta] - I_0(u,v))^2 = 0 $$ $$ 2 (\frac{\partial}{\partial \theta} I(W(x,\theta)) + \frac{\partial}{\partial \theta} J_{\theta} \epsilon_\theta - \frac{\partial}{\partial \theta} I_0(x))^{\intercal} ([I(W(x,\theta)) + J_{\theta} \epsilon_\theta] - I_0(u,v)) = 0 $$ Remove terms not dependent on \(\theta\), divide both sides by -2: $$ J_{\theta}^{\intercal} ( - I(W(x,\theta)) - J_{\theta} \epsilon_\theta + I_0(u,v)) = 0 $$ distribute and rearrange, $$ J_{\theta}^\intercal J_{\theta} \epsilon_\theta = J_{\theta}^{\intercal} (I_0(u,v) - I(W(x,\theta))) $$ solve for \(\epsilon_\theta\): $$ \epsilon_\theta = \sum_x (J_{\theta}^\intercal J_{\theta} )^{-1} J_{\theta}^\intercal (I_0(u,v) - I(W(x,\theta))) $$ For multivariate functions, the derivative of all outputs with regard to every input forms a matrix, \(J_\theta\), called the Jacobian.

Let’s just try to get a little intuition. We want to determine the Jacobian of our equation above. The Jacobian is the matrix of partial derivatives for every output with respect to all the inputs. We ultimately care about how changing our rotation and translation parameters, , will change the intensity of our image. It turns out that we can “factor” our Jacobian, or decompose it into a set of matrices that when multiplied together form the complete Jacobian. For our case, it turns out we can factor the Jacobian like this:

Intuitively, we separate the Jacobian into the image intensity gradient and the derivative of the warping function. We can approximate the intensity gradient of our image by using Sobel gradient kernels. We convolve our image with two kernels, one sensitive to changes in “u”, and another sensitive to “v”. This produces two images:

graduimg = cv2.Sobel(img,cv2.CV_32F,1,0,ksize=int(31))
gradvimg = cv2.Sobel(img,cv2.CV_32F,0,1,ksize=int(31))

The gradient for each direction (u and v respectively).

In the images above red correspond to high values and blue to low values. You’ll notice that regions of the image with strong intensity gradients (like edges) form local “slopes” that go from high to low values. In this way, the intensity gradients are points in the direction we want to nudge “u” and “v” to either increase or decrease the image intensity of a pixel.

As you can see above, we can calculate fairly easily, now we need to figure out how changing changes “u” and “v”. In other words, we need to determine .

This is where things get hairy, as we stare blankly at and wonder how we differentiate something like that. Most textbooks kinda skim over this step, and it’s sometimes a huge pain, or even impossible (if our warping function is non-differentiable). The good news is that Pytorch can do much of this for us.

Finding the Jacobians

As hinted above, the fantastically good news is that we can use Pytorch to do all the dirty work. All we need to do is define our forward model in terms of the underlying parameters and then perform backprop.

“Wait, haven’t we already done that?”

Yes! We just have to do a backward pass through the Pytorch graph we defined above and we get .

# grad_mask chooses the input value "x" we care about
# we assume pi is defined for u and then v
pi.backward(grad_mask, retain_variables=True)
# p are our original values "\theta", n codes for "x"
# we do the same thing for "v" (see the full gist)
Jac_u[n,:] =

So this get’s us our Jacobians for in terms of our parameters p! Now, we just have to write the final update equation for the standard non-linear least squares solution and we can calculate how to update our parameters to run our optimization. This takes a bit of vectorization, but in python it looks like this:

# construct full dW/dp Jacobian
Wp = np.vstack([Jac_u.numpy(),Jac_v.numpy()])
# We flatten the gradients above to be a 1D vector
delI = np.hstack([np.diag(graduimg_flat),np.diag(gradvimg_flat)])
# get Jacobian
Jac =
# residual error
r = (accum - img).reshape(-1,1)
# solve for parameter update
delP = pinv(Jac).dot(r)
p = p - delP

So that’s it! Well, almost. This works OK when the images are already fairly well aligned since it’s dependent on the pixels generally lying within the “region of influence” of . In other words, this update technique is susceptible to “local minima” and while it’s great at locking in the solution, we’d like to get somewhat close first.

We will add a few more little tricks to get our final result.

Sneaky tricks and areas for improvement

One thing you may notice if you run the gist is that pytorch is actually really slow since we define a computational graph and calculate the gradient for each pixel individually. We can subsample the pixels to speed this up, and use a number of heuristics to choose “good” pixels. The primary heuristic is to avoid pixels near intensity gradients, these pixels tend to dominate the error and lead to very noisy estimates of the Jacobian. Furthermore, inspired by SGD, we calculate the Jacobian for a new set of points each time.

Another thing that works well is to randomly initialize the parameters by calculating the square error for a bunch of different . This is fast, and we can then select the set of that seem to get us close before performing the gradient updates that lock in the solution. We can call this bootstrap routine at regular intervals to further lock in. This generates the “jumps” seen in the updates as the random guessing breaks us out of local minima.

Great! Let’s see our final result:

The final result for a somewhat big camera movement. The random guessing helps put the image in a good region before the gradient based method is able to improve alignment.

The numbers on the bottom of the frame correspond to each of the parameters, . The first three are values of the angle-axis vector in degrees… roughly these correspond to the angle rotated along each axis x,y,z using a right-hand coordinate system with the thumb (z) pointing along the optical axis. The last three numbers are the translation vector along x,y,z in meters. As you can see, the result is reasonable.

Frame differences that are less extreme work even better (partly because depth estimate errors play less of an important role):

Close alignment, observe the image get drawn towards a better alignment.

Closing thoughts

We could solve the non-linear least squares problem any number of different ways. The ceres solver documentation covers the general concepts well. In our case, we combine some random guessing “bootstrapping” with the Hessian update used above. In the future, it would be interesting to try a direct Gradient Descent method, or some hybrid (also known as Levenberg-Marquardt). And while there are fancier and more efficient solvers we could employ, the approach we use in the post gets us surprisingly far.

There may be ways to speed up our code significantly by being a bit smarter about how we use pytorch (caching the backward calculation). Even better, we can determine the derivative analytically and write all our math using numpy. We could also greatly accelerate our image warping and gradient operations using OpenGL shaders.

Another thing we’d like to do is update our depth estimates. Once an alignment solution is found, there may be some disagreement between the images along the epipolar lines. This indicates that we can further refine our depth estimates beyond what the neural network estimated since there are geometrical constraints we have not until now considered. Before reading the next post, consider how such an update could be achieved when overlaying the lines associated with these epipolar constraints:

Aligned images with epipoles drawn on. Observe what parts of error can be explained as lying along the epipoles (suggesting refinements to the depth estimate would improve results).

And I think we’re done here! Tune into part 2 as we address some of these points, as well as other approaches such as feature based methods!

1 There are more complex models that capture nuanced imaging effects such as spatial distortion, and aperture effects like depth of field and resolution limits imposed by diffraction. The beauty of the approach to the problem we will take in this that we can extend it to incorporate these effects by including them in our forward model W. [↩]
2 I used the pretrained model here: FCRN-DepthPrediction. I won't detail how to get this running since there's plenty of other resources, let's assume for each image we spit out a numpy array with the output from this network. I had to scale the output to account for the difference in the focal length between the kinect and my smartphone. I just tried a few settings until the depth estimates made sense. In principle we could train our own pytorch model here too, but this approach works surprisingly well! [↩]
3 I used the three.js library for webGL. [↩]
4 There are also more sophisticated representations that simplify certain operations, such as quarternions, or compositional operations on SO(3) using the so(3) Lie Algebra. These all are very related to angle-axis through Euler's formula, and is a fascinating mathematical topic! [↩]
5 A better approach might be to render this out using OpenGL and perform the interpolation and occlusion using a shader program. I already did this to visualize the scene in webGL for the animated GIF, but to keep things simple, remap() works just fine. [↩]
6 There are many libraries that solve this problem. I think it's worth going through the mathematical motivation though! [↩]