Implementing PTAM: stereo, tracking and pose estimation for AR with OpenCV [w/ code]

Hi

Been working hard at a project for school the past month, implementing one of the more interesting works I've seen in the AR arena: Parallel Tracking and Mapping (PTAM) [PDF]. This is a work by George Klein [homepage] and David Murray from Oxford university, presented in ISMAR 2007.

When I first saw it on youtube [link] I immediately saw the immense potential - mobile markerless augmented reality. I thought I should get to know this work a bit more closely, so I chose to implement it as a part of advanced computer vision course, given by Dr. Lior Wolf [link] at TAU.

The work is very extensive, and clearly is a result of deep research in the field, so I set to achieve a few selected features: Stereo initialization, Tracking, and small map upkeeping. I chose not to implement relocalization and full map handling.

This post is kind of a tutorial for 3D reconstruction with OpenCV 2.0. I will show practical use of the functions in cvtriangulation.cpp, which are not documented and in fact incomplete. Furthermore I'll show how to easily combine OpenCV and OpenGL for 3D augmentations, a thing which is only briefly described in the docs or online.

Here are the step I took and things I learned in the process of implementing the work.

Update: A nice patch by yazor fixes the video mismatching - thanks! and also a nice application by Zentium called "iKat" is doing some kick-ass mobile markerless augmented reality.

Preparations...

Before going straight to coding, I had to prepare a few things.

  • A working compilation of OpenCV - not trivial with the new version 2.0.
  • A calibrated camera.
  • Test data

Compiling OpenCV 2.0 proved to be a bit tricky. Even though the sourceforge project offers binary release for Win32, I compiled the whole thing from source. It turned out the binary release doesn't contain .lib files, and anyway has compatibility issues between MS VS 2005 and 2008 - something about the embedded manifest [google]. I downloaded the freshest source from SVN, and compiled it, but it didn't solve the debug-release problem, so I was left with using the release dlls even for debug evironment.

Initially I thought I'll try an uncalibrated camera approach, but soon abandoned it. I had to calibrate my cameras, which I did  very easily using OpenCV's "calibration.cpp", which strangely is not built when building all examples - it has to be built manually. But everything went smoothly, and I soon got a calibration matrix (focal length, center of projection) and radial distortion coefficients.

Getting Test Data

For the test data I wanted to get a few views of a planar scene, where the first two views are separated only by a translation of ~5cm, as K&M do in the PTAM article. This known translation is helpful when trying to triangulate the initial features in the scene. When you have prior knowledge of where the cameras are, you can simply intersect the epipolar lines between the two views and recover the 3D position of the points - up to a scale. Keep in mind you must also have feature correspondence: a point on image A must be correlated to a point in image B.

To achieve this I set up a small program that uses Optical Flow to track some 2D features in the scene, and grab a few screens + feature vectors. See 'capture_data.cpp'.

Stereo Initialization

Now that I have 2 views with feature correspodence:

I would like to triangulate the features. This is possible, as I discussed earlier, since I know the rotation (none), translation (5cm on -x axis) and camera calibration parameters (focal length, center of projection).

Triangulation

For triangulation, OpenCV has only recently added a couple of functions that implement triangulation [link] as shown by Hartly & Zisserman [PDF, page 12]. However, these functions are not formally documented, and in fact they are missing some important parts. This is how I used cvTriangulation(), which is the key function:

//this function will initialize the 3D features from two views
void stereoInit() {

//first load camera intrinsic parameters
FileStorage fs("cam_work.out",CV_STORAGE_READ);
FileNode fn = fs["camera_matrix"];
camera_matrix = Mat((CvMat*)fn.readObj(),true);

fn = fs["distortion_coefficients"];
distortion_coefficients = Mat((CvMat*)fn.readObj(),true);

//vector<Point2d> points[2]; //these Point2d vectors hold the 2D features, double precision, from the 2 views

//get copy of points
_points[0] = points[0];
_points[1] = points[1];
Mat pts1M(_points[0]), pts2M(_points[1]); //very easy in OpenCV 2.0 to convert vector<> to Mat.

//Undistort points
Mat tmp,tmpOut;
pts1M.convertTo(tmp,CV_32FC2);  //undistort takes only floats not doubles, so convert to Point2f
undistortPoints(tmp,tmpOut,camera_matrix,distortion_coefficients);
tmpOut.convertTo(pts1M,CV_64FC2);  //go back to double precision

pts2M.convertTo(tmp,CV_32FC2);
undistortPoints(tmp,tmpOut,camera_matrix,distortion_coefficients);
tmpOut.convertTo(pts2M,CV_64FC2);

vector<uchar> tri_status; //this will hold the status for each point, a good point will have 1, bad - 0

//now triangulate
triangulate(_points[0],_points[1],tri_status);

}

void triangulate(vector<Point2d>& points1, vector<Point2d>& points2, vector<uchar>& status) {

	//Convert points to 1-channel, 2-rows, double precision - This is important - see the code
...

	Mat ___tmp(2,pts1Mt.cols,CV_64FC1,__d);
...
	Mat ___tmp1(2,pts2Mt.cols,CV_64FC1,__d1);
...

	CvMat __points1 = ___tmp, __points2 = ___tmp1;

	//projection matrices
	double P1d[12] = {	-1,0,0,0,
						0,1,0,0,
						0,0,1,0 };	//Identity, but looking into -z axis
	Mat P1m(3,4,CV_64FC1,P1d);
	CvMat* P1 = &(CvMat)P1m;
	double P2d[12] = {	-1,0,0,-5,
						0,1,0,0,
						0,0,1,0 };  //Identity rotation, 5cm -x translation, looking into -z axis
	Mat P2m(3,4,CV_64FC1,P2d);
	CvMat* P2 = &(CvMat)P2m;

	float _d[1000] = {0.0f};
	Mat outTM(4,points1.size(),CV_32FC1,_d);
	CvMat* out = &(CvMat)outTM;

//using cvTriangulate with the created structures
	cvTriangulatePoints(P1,P2,&__points1,&__points2,out);

//we should check the triangulation result by reprojecting 3D->2D and checking distance
	vector<Point2d> projPoints[2] = {points1,points2};

	double point2D_dat[3] = {0};
	double point3D_dat[4] = {0};
	Mat twoD(3,1,CV_64FC1,point2D_dat);
	Mat threeD(4,1,CV_64FC1,point3D_dat);

	Mat P[2] = {Mat(P1),Mat(P2)};

	int oc = out->cols, oc2 = out->cols*2, oc3 = out->cols*3;

	status = vector<uchar>(oc);

	//scan all points, reproject 3D->2D, and keep only good ones
	for(int i=0;i<oc;i++) {
		double W = out->data.fl[i+oc3];
        point3D_dat[0] = out->data.fl[i] / W;
        point3D_dat[1] = out->data.fl[i+oc] / W;
        point3D_dat[2] = out->data.fl[i+oc2] / W;
        point3D_dat[3] = 1;

        bool push = true;
        /* !!! Project this point for each camera */
        for( int currCamera = 0; currCamera < 2; currCamera++ )
        {
            //reproject! using the P matrix of the current camera
			twoD = P[currCamera] * threeD;

            float x,y;
            float xr,yr,wr;
 	x = (float)projPoints[currCamera][i].x;
	y = (float)projPoints[currCamera][i].y;

            wr = (float)point2D_dat[2];
            xr = (float)(point2D_dat[0]/wr);
            yr = (float)(point2D_dat[1]/wr);

            float deltaX,deltaY;
            deltaX = (float)fabs(x-xr);
            deltaY = (float)fabs(y-yr);

			//printf("error from cam %d (%.2f,%.2f): %.6f %.6f\n",currCamera,x,y,deltaX,deltaY);

			if(deltaX > 0.01 || deltaY > 0.01) {
				push = false;
			}
        }
		if(push) {
			// A good 3D reconstructed point, add to known world points

			double s = 7;
			Point3d p3d(point3D_dat[0]/s,point3D_dat[1]/s,point3D_dat[2]/s);
			//printf("%.3f %.3f %.3f\n",p3d.x,p3d.y,p3d.z);
			points1Proj.push_back(p3d);
			status[i] = 1;
		} else {
			status[i] = 0;
		}

	}
}

OK, now that I have (hopefully) triangulated 3D features from the initial state: 2 views of a planar scene with 5cm translation on the X axis - I can move on the pose estimation.

Pose Estimation

Theoretically, if I know the 3D position of features in the world and their respective 2D position in the image, it should be easy to recover the position of the camera, because there are a rotation matrix and translation vector that define this transformation. Practically in OpenCV, finding the position of an object using 3D-2D correlation is done by using the solvePnP() [link] function.

Since I have an initial guess of the rotation and translation - from the first 2 frames - I can "help" the function estimate the new ones.

void findExtrinsics(vector<Point2d>& points, vector<double>& rv, vector<double>& tv) {
	//estimate extrinsics for these points

	Mat rvec(rv),tvec(tv);

//initial "guess", in case it wasn't already supplied
	if(rv.size()!=3) {
		rv = vector<double>(3);
		rvec = Mat(rv);
		double _d[9] = {1,0,0,
						0,-1,0,
						0,0,-1};
		Rodrigues(Mat(3,3,CV_64FC1,_d),rvec);
	}
	if(tv.size()!=3) {
		tv = vector<double>(3);
		tv[0]=0;tv[1]=0;tv[2]=0;
		tvec = Mat(tv);
	}

	//create a float rep  of points
	vector<Point2f> v2(points.size());
	Mat tmpOut(v2);
	Mat _tmpOut(points);
	_tmpOut.convertTo(tmpOut,CV_32FC2);

	solvePnP(points1projMF,tmpOut,camera_matrix,distortion_coefficients,rvec,tvec,true);

	printf("frame extrinsic:\nrvec: %.3f %.3f %.3f\ntvec: %.3f %.3f %.3f\n",rv[0],rv[1],rv[2],tv[0],tv[1],tv[2]);

//the output of the function is a Rodrigues form of rotation, so convert to regular rot-matrix
	Mat rotM(3,3,CV_64FC1); ///,_r);
	Rodrigues(rvec,rotM);
	double* _r = rotM.ptr<double>();
	printf("rotation mat: \n %.3f %.3f %.3f\n%.3f %.3f %.3f\n%.3f %.3f %.3f\n",
		_r[0],_r[1],_r[2],_r[3],_r[4],_r[5],_r[6],_r[7],_r[8]);
}

After getting the extrinsic parameters of the camera, the next step is plugging in the visualization!

Integrating OpenGL

Generally, it should be possible to create a 3D scene that matches exactly the true world scene, where the triangulated features appear in the scene aligned exactly with the world. I was not able to achieve that, but I got pretty close:

It's basically what you do in augmented reality, you align the virtual camera's position and rotation with the results you get from the vision part of the system. In the pose estimation we ended with a 3D rotation vector (Rodrigues form) and 3D translation vector which is used as-is, so only the rotation vector should be converted to 3x3 matrix using the Rodrigues() function.

This is the OpenGL glut display() function that draws the scene:

void display(void)
{
	glClearColor(1.0f, 1.0f, 1.0f, 0.5f);
	glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);	// Clear Screen And Depth Buffer

	//draw the background - the frame from the camers
	glMatrixMode(GL_PROJECTION);
	glPushMatrix();
	gluOrtho2D(0.0,352.0,288.0,0.0);
	glMatrixMode(GL_MODELVIEW);
	glPushMatrix();
	glDisable(GL_DEPTH_TEST);
	glDrawPixels(352,288,GL_RGB,GL_UNSIGNED_BYTE,backPxls.data);
	glEnable(GL_DEPTH_TEST);
	glPopMatrix();
	glMatrixMode(GL_PROJECTION);
	glPopMatrix();

    const double t = glutGet(GLUT_ELAPSED_TIME) / 1000.0;
	a = t*20.0;

	glMatrixMode(GL_MODELVIEW);
	glLoadIdentity();

//use the camera position 3D vector
	curCam[0] = cam[0]; curCam[1] = cam[1]; curCam[2] = cam[2];
//there seems to be some kind of offset...
	glTranslated(-curCam[0]+0.5,-curCam[1]+0.7,-curCam[2]);

//and the 3x3 rotation matrix
	double _d[16] = {	rot[0],rot[1],rot[2],0,
						rot[3],rot[4],rot[5],0,
						rot[6],rot[7],rot[8],0,
						0,	   0,	  0		,1};
	glMultMatrixd(_d);

//flip the rotation on the x-axis
	glRotated(180,1,0,0);

	//draw the 3D feature points
	glPushMatrix();
	glColor4d(1.0,0.0,0.0,1.0);
	for(unsigned int i=0;i<points1Proj.size();i++) {
		glPushMatrix();
glTranslated(points1Proj[i].x,points1Proj[i].y,points1Proj[i].z);
		glutSolidSphere(0.03,15,15);
		glPopMatrix();
	}
	glPopMatrix();

	glutSwapBuffers();

	if(!running) {
		glutLeaveMainLoop();
	}

	Sleep(25);
}

This pretty much coveres my work, in a very concise way. The complete source code will reveal all I have done, and will provide a better copy-and-paste ground for your own projects.

Things not covered in this work

Initially I tried to implement a very crucial part of the PTAM work - pairing the 3D map with 2D features in the image. This allows them to re-align the map in every frame (when the tracking is bad) so the pose estimation does not "loose grip". In essence, they keep a visual identity for each map feature, very similar to a descriptor like SURF or SIFT, so at any point they can find where in the new image are the features and recover the camera pose from the 2D-3D correspondence. I ran into a problem utilizing OpenCV's SURF functionality, it seems to have a bug when trying to compute the descriptor for user-given feature points.

Another thing I chose not to implement is creating a full map of the surroundings. I wanted to achieve a simple working solution for a small map (essentially a single frame), and see how it works. In the original work by K&M they constantly add more and more features to the map untill it has covered the whole surrounding room.

Code and Working the Program

As usual my code is available for checkout from the blog's SNV repo:

svn checkout http://morethantechnical.googlecode.com/svn/trunk/ptam ptam

To get the stereo initialization you must press [spacebar] twice: Once when the camera has stabilized and the features are stable, and another time when the camera has translated and again stabilized.
This marks the 2 keyframes that will be used for stereo init and triangulation.
From that point on, the 3D scene will start and the track-and-estimate stage begins. Try not to move the camera violently as the optical flow may suffer.

Thanks Lior for your help getting the hang of these subjects, and the opportunity to meddle with a subject I long gone wanted to explore.

I hope everyone will enjoy and learn from my enjoyment and learning.

Bye!

Roy.

Share