This time I'll discuss a basic implementation of a Structure from Motion method, following the steps Hartley and Zisserman show in "The Bible" book: "Multiple View Geometry". I will show how simply their linear method can be implemented in OpenCV.
I treat this as a kind of tutorial, or a toy example, of how to perform Structure from Motion in OpenCV.
See related posts on using Qt instead of FLTK, triangulation and decomposing the essential matrix.
Let's get down to business...
Getting a motion map
The basic thing when doing reconstruction from pairs of images, is that you know the motion: How much "a pixel has moved" from one image to the other. This gives you the ability to reconstruct it's distance from the camera(s). So our first goal is to try and understand that from a pair of two images.
In calibrated horizontal stereo rigs this is called Disparity, and it refers to the horizontal motion of a pixel. And OpenCV actually has some very good tools to recover horizontal disparity, that can be seen in this sample.
But in our case we don't have a calibrated rig as we are doing monocular (one camera) depth reconstruction, or in other words: Structure from motion.
You can go about getting a motion map in many different ways, but two canonical ways are: optical flow and feature matching.
Also, I will stick to what OpenCV has to offer, but obviously there is a whole lot of work.
In optical flow you basically try to "track the pixels" from image 1 to 2, usually assuming a pixel can move only within a certain window in which you will search. OpenCV offers some ways to do optical flow, but I will focus on the newer and nicer one: Farenback's method for dense optical flow.
The word dense means we look for the motion for every pixel in the image. This is usually costly, but Farneback's method is linear which is easy to solve, and they have a rocking implementation of it in OpenCV so it basically flies.
Running the function on two images will provide a motion map, however my experiments show that this map is wrong in a fair bit of the times... To cope with that, I am doing an iterative operation, also leveraging the fact the this OF method can use an initial guess.
An example of using Farneback method exists in the samples directory of OpenCV's repo: here.
The other way of getting motion is matching features between the two images.
In each image we extract salient features and invariant descriptors, and then match the two sets of features.
It's very easily done in OpenCV and widely covered by examples and tutorials.
This method however, will not provide a dense motion map. It will provide a very sparse one at best... so that depth reconstruction will also be sparse. We may talk about how to overcome that by hacking some segmentation methods, like superpixels and graph-cuts, in a different post.
A hybrid method
Another way that I am working on to get motion is a hybrid between Feature Matching and Optical Flow.
Basically the idea is to perform feature matching at first, and then O-F. When the motion is big, and features move quite a lot in the image, O-F sometimes fails (because pixel movement is usually confined to a search window).
After we get features pairs, we can try to recover a global movement in the image. We use that movement as an initial guess for O-F.
Once we have a motion map between the two images, it should pose no problem to recover the motion of the camera. The motion is described in the 3x4 matrix P, which is combined of two elements: P = [R|t], which are the Rotational element R and Translational element t.
H&Z give us a bunch of ways of recovering the P matrices for both cameras in Chapter 9 of their book. The central method being - using the Fundamental Matrix. This special 3x3 matrix encodes the epipolar constraint between the images, to put simply: for each point x in image 1 and corresponding point x' in image 2 the following equation holds: x'Fx = 0.
How does that help us? Well H&Z also prove that if you have F, you can infer the two P matrices. And, if you have (sufficient) point matches between images, which we have, you can find F! Hurray!
This is simply visible in the linear sense. F has 9 entries (but only 8 degrees of freedom), so if we have enough point pairs, we can solve for F in a least squares sense. But... F is better estimated in a more robust way, and OpenCV takes care of all of this for us in the function findFundamentalMat. There are several methods for recovering F there, linear and non-linear.
However, H&Z also point to a problem with using F right away - projective ambiguity. This means that the recovered camera matrices may not be the "real" ones, but instead have gone through some 3D projective transformation. To cope with this, we will use the Essential Matrix instead, which is sort of the same thing (holds epiploar constraint over points) but for calibrated cameras. Using the Essential matrix removes the projective ambiguity and provides a Metric (or Singular) Reconstruction, which means the 3D points are true up to scaling alone, and not up to a projective transformation.
cv::FileStorage fs; fs.open("camera_calibration.yml",cv::FileStorage::READ); fs["camera_matrix"]>>K; Mat F = findFundamentalMat(imgpts1, imgpts2, FM_RANSAC, 0.1, 0.99, status); Mat E = K.t() * F * K; //according to HZ (9.12)
Now let's assume one camera is P = [I|0], meaning it hasn't moved or rotated, getting the second camera matrix, P' = [R|t], is done as follows:
SVD svd(E); Matx33d W(0,-1,0, //HZ 9.13 1,0,0, 0,0,1); Matx33d Winv(0,1,0, -1,0,0, 0,0,1); Mat_ R = svd.u * Mat(W) * svd.vt; //HZ 9.19 Mat_ t = svd.u.col(2); //u3 P1 = Matx34d(R(0,0), R(0,1), R(0,2), t(0), R(1,0), R(1,1), R(1,2), t(1), R(2,0), R(2,1), R(2,2), t(2));
Looks good, now let's move on to reconstruction.
Reconstruction via Triangulation
Once we have two camera matrices, P and P', we can recover the 3D structure of the scene. This can be seen simply if we think about it using ray intersection. We have two points in space of the camera centers (one in 0,0,0 and one in t), and we have the location in space of a point both on the image plane of image 1 and on the image plane of image 2. If we simply shoot a ray from from one camera center through the respective point and another ray from the other camera - the intersection of the two rays must be the real location of the object in space.
In real life, none of that works. The rays usually will not intersect (so H&Z refer to the mid-point algorithm, which they dismiss as a bad choice), and ray intersection in general is inferior to other triangulation methods.
H&Z go on to describe their "optimal" triangulation method, which optimizes the solution based on the error from reprojection of the points back to the image plane.
I have implemented the linear triangulation methods they present, and wrote a post about it not long ago: Here.
I also added the Iterative Least Squares method that Hartley presented in his article "Triangulation", which is said to perform very good and very fast.
A word of notice, many many times the reconstruction will fail because the Fundamental matrix came out wrong. The results will just look aweful, and nothing like a true reconstruction. To cope with this, you may want to insert a check that will make sure the two P matrices are not completely bogus (you could check for a reasonable rotation for example). If the P matrices, that are derived from the F matrix, are strange, then you can discard this F matrix and compute a new one.
Toolbox and Framework
I created a small toolbox of the various methods I spoke about in this post, and created a very simple UI. It basically allows you to load two images and then try the different methods on them and get the results.
It's using FLTK3 for the GUI, and PCL (VTK backend) for visualization of the result 3D point cloud.
It also includes a few classes with a simple API that let's you get the features matches, motion map, camera matrices from the motion, and finally the 3D point cloud.
Code & Where to go next
The code, as usual, is up for grabs at github:
Now, that have a firm grasp of SfM 🙂 you can go on to visit the following projects, which implement a much more robust solution:
And Wikipedia points to some interesting libraries and code as well: http://en.wikipedia.org/wiki/Structure_from_motion