Implementing PTAM: stereo, tracking and pose estimation for AR with OpenCV [w/ code]

Post author By Roy
Post date March 6, 2010
23 Comments on Implementing PTAM: stereo, tracking and pose estimation for AR with OpenCV [w/ code]

Hi
Been working hard at a project for school the past month, implementing one of the more interesting works I’ve seen in the AR arena: Parallel Tracking and Mapping (PTAM) [PDF]. This is a work by George Klein [homepage] and David Murray from Oxford university, presented in ISMAR 2007.
When I first saw it on youtube [link] I immediately saw the immense potential – mobile markerless augmented reality. I thought I should get to know this work a bit more closely, so I chose to implement it as a part of advanced computer vision course, given by Dr. Lior Wolf [link] at TAU.
The work is very extensive, and clearly is a result of deep research in the field, so I set to achieve a few selected features: Stereo initialization, Tracking, and small map upkeeping. I chose not to implement relocalization and full map handling.
This post is kind of a tutorial for 3D reconstruction with OpenCV 2.0. I will show practical use of the functions in cvtriangulation.cpp, which are not documented and in fact incomplete. Furthermore I’ll show how to easily combine OpenCV and OpenGL for 3D augmentations, a thing which is only briefly described in the docs or online.
Here are the step I took and things I learned in the process of implementing the work.
Update: A nice patch by yazor fixes the video mismatching – thanks! and also a nice application by Zentium called “iKat” is doing some kick-ass mobile markerless augmented reality.

Preparations…

Before going straight to coding, I had to prepare a few things.

A working compilation of OpenCV – not trivial with the new version 2.0.
A calibrated camera.
Test data

Compiling OpenCV 2.0 proved to be a bit tricky. Even though the sourceforge project offers binary release for Win32, I compiled the whole thing from source. It turned out the binary release doesn’t contain .lib files, and anyway has compatibility issues between MS VS 2005 and 2008 – something about the embedded manifest [google]. I downloaded the freshest source from SVN, and compiled it, but it didn’t solve the debug-release problem, so I was left with using the release dlls even for debug evironment.
Initially I thought I’ll try an uncalibrated camera approach, but soon abandoned it. I had to calibrate my cameras, which I did very easily using OpenCV’s “calibration.cpp”, which strangely is not built when building all examples – it has to be built manually. But everything went smoothly, and I soon got a calibration matrix (focal length, center of projection) and radial distortion coefficients.

Getting Test Data

For the test data I wanted to get a few views of a planar scene, where the first two views are separated only by a translation of ~5cm, as K&M do in the PTAM article. This known translation is helpful when trying to triangulate the initial features in the scene. When you have prior knowledge of where the cameras are, you can simply intersect the epipolar lines between the two views and recover the 3D position of the points – up to a scale. Keep in mind you must also have feature correspondence: a point on image A must be correlated to a point in image B.
To achieve this I set up a small program that uses Optical Flow to track some 2D features in the scene, and grab a few screens + feature vectors. See ‘capture_data.cpp’.

Stereo Initialization

Now that I have 2 views with feature correspodence:

I would like to triangulate the features. This is possible, as I discussed earlier, since I know the rotation (none), translation (5cm on -x axis) and camera calibration parameters (focal length, center of projection).

Triangulation

For triangulation, OpenCV has only recently added a couple of functions that implement triangulation [link] as shown by Hartly & Zisserman [PDF, page 12]. However, these functions are not formally documented, and in fact they are missing some important parts. This is how I used cvTriangulation(), which is the key function:

//this function will initialize the 3D features from two views
void stereoInit() {
//first load camera intrinsic parameters
FileStorage fs("cam_work.out",CV_STORAGE_READ);
FileNode fn = fs["camera_matrix"];
camera_matrix = Mat((CvMat*)fn.readObj(),true);
fn = fs["distortion_coefficients"];
distortion_coefficients = Mat((CvMat*)fn.readObj(),true);
//vector<Point2d> points[2]; //these Point2d vectors hold the 2D features, double precision, from the 2 views
//get copy of points
_points[0] = points[0];
_points[1] = points[1];
Mat pts1M(_points[0]), pts2M(_points[1]); //very easy in OpenCV 2.0 to convert vector<> to Mat.
//Undistort points
Mat tmp,tmpOut;
pts1M.convertTo(tmp,CV_32FC2);  //undistort takes only floats not doubles, so convert to Point2f
undistortPoints(tmp,tmpOut,camera_matrix,distortion_coefficients);
tmpOut.convertTo(pts1M,CV_64FC2);  //go back to double precision
pts2M.convertTo(tmp,CV_32FC2);
undistortPoints(tmp,tmpOut,camera_matrix,distortion_coefficients);
tmpOut.convertTo(pts2M,CV_64FC2);
vector<uchar> tri_status; //this will hold the status for each point, a good point will have 1, bad - 0
//now triangulate
triangulate(_points[0],_points[1],tri_status);
}
void triangulate(vector<Point2d>& points1, vector<Point2d>& points2, vector<uchar>& status) {
	//Convert points to 1-channel, 2-rows, double precision - This is important - see the code
...
	Mat ___tmp(2,pts1Mt.cols,CV_64FC1,__d);
...
	Mat ___tmp1(2,pts2Mt.cols,CV_64FC1,__d1);
...
	CvMat __points1 = ___tmp, __points2 = ___tmp1;
	//projection matrices
	double P1d[12] = {	-1,0,0,0,
						0,1,0,0,
						0,0,1,0 };	//Identity, but looking into -z axis
	Mat P1m(3,4,CV_64FC1,P1d);
	CvMat* P1 = &(CvMat)P1m;
	double P2d[12] = {	-1,0,0,-5,
						0,1,0,0,
						0,0,1,0 };  //Identity rotation, 5cm -x translation, looking into -z axis
	Mat P2m(3,4,CV_64FC1,P2d);
	CvMat* P2 = &(CvMat)P2m;
	float _d[1000] = {0.0f};
	Mat outTM(4,points1.size(),CV_32FC1,_d);
	CvMat* out = &(CvMat)outTM;
//using cvTriangulate with the created structures
	cvTriangulatePoints(P1,P2,&__points1,&__points2,out);
//we should check the triangulation result by reprojecting 3D->2D and checking distance
	vector<Point2d> projPoints[2] = {points1,points2};
	double point2D_dat[3] = {0};
	double point3D_dat[4] = {0};
	Mat twoD(3,1,CV_64FC1,point2D_dat);
	Mat threeD(4,1,CV_64FC1,point3D_dat);
	Mat P[2] = {Mat(P1),Mat(P2)};
	int oc = out->cols, oc2 = out->cols*2, oc3 = out->cols*3;
	status = vector<uchar>(oc);
	//scan all points, reproject 3D->2D, and keep only good ones
	for(int i=0;i<oc;i++) {
		double W = out->data.fl[i+oc3];
        point3D_dat[0] = out->data.fl[i] / W;
        point3D_dat[1] = out->data.fl[i+oc] / W;
        point3D_dat[2] = out->data.fl[i+oc2] / W;
        point3D_dat[3] = 1;
        bool push = true;
        /* !!! Project this point for each camera */
        for( int currCamera = 0; currCamera < 2; currCamera++ )
        {
            //reproject! using the P matrix of the current camera
			twoD = P[currCamera] * threeD;
            float x,y;
            float xr,yr,wr;
 	x = (float)projPoints[currCamera][i].x;
	y = (float)projPoints[currCamera][i].y;
            wr = (float)point2D_dat[2];
            xr = (float)(point2D_dat[0]/wr);
            yr = (float)(point2D_dat[1]/wr);
            float deltaX,deltaY;
            deltaX = (float)fabs(x-xr);
            deltaY = (float)fabs(y-yr);
			//printf("error from cam %d (%.2f,%.2f): %.6f %.6f\n",currCamera,x,y,deltaX,deltaY);
			if(deltaX > 0.01 || deltaY > 0.01) {
				push = false;
			}
        }
		if(push) {
			// A good 3D reconstructed point, add to known world points
			double s = 7;
			Point3d p3d(point3D_dat[0]/s,point3D_dat[1]/s,point3D_dat[2]/s);
			//printf("%.3f %.3f %.3f\n",p3d.x,p3d.y,p3d.z);
			points1Proj.push_back(p3d);
			status[i] = 1;
		} else {
			status[i] = 0;
		}
	}
}

OK, now that I have (hopefully) triangulated 3D features from the initial state: 2 views of a planar scene with 5cm translation on the X axis – I can move on the pose estimation.

Pose Estimation

Theoretically, if I know the 3D position of features in the world and their respective 2D position in the image, it should be easy to recover the position of the camera, because there are a rotation matrix and translation vector that define this transformation. Practically in OpenCV, finding the position of an object using 3D-2D correlation is done by using the solvePnP() [link] function.
Since I have an initial guess of the rotation and translation – from the first 2 frames – I can “help” the function estimate the new ones.

void findExtrinsics(vector<Point2d>& points, vector<double>& rv, vector<double>& tv) {
	//estimate extrinsics for these points
	Mat rvec(rv),tvec(tv);
//initial "guess", in case it wasn't already supplied
	if(rv.size()!=3) {
		rv = vector<double>(3);
		rvec = Mat(rv);
		double _d[9] = {1,0,0,
						0,-1,0,
						0,0,-1};
		Rodrigues(Mat(3,3,CV_64FC1,_d),rvec);
	}
	if(tv.size()!=3) {
		tv = vector<double>(3);
		tv[0]=0;tv[1]=0;tv[2]=0;
		tvec = Mat(tv);
	}
	//create a float rep  of points
	vector<Point2f> v2(points.size());
	Mat tmpOut(v2);
	Mat _tmpOut(points);
	_tmpOut.convertTo(tmpOut,CV_32FC2);
	solvePnP(points1projMF,tmpOut,camera_matrix,distortion_coefficients,rvec,tvec,true);
	printf("frame extrinsic:\nrvec: %.3f %.3f %.3f\ntvec: %.3f %.3f %.3f\n",rv[0],rv[1],rv[2],tv[0],tv[1],tv[2]);
//the output of the function is a Rodrigues form of rotation, so convert to regular rot-matrix
	Mat rotM(3,3,CV_64FC1); ///,_r);
	Rodrigues(rvec,rotM);
	double* _r = rotM.ptr<double>();
	printf("rotation mat: \n %.3f %.3f %.3f\n%.3f %.3f %.3f\n%.3f %.3f %.3f\n",
		_r[0],_r[1],_r[2],_r[3],_r[4],_r[5],_r[6],_r[7],_r[8]);
}

After getting the extrinsic parameters of the camera, the next step is plugging in the visualization!

Integrating OpenGL

Generally, it should be possible to create a 3D scene that matches exactly the true world scene, where the triangulated features appear in the scene aligned exactly with the world. I was not able to achieve that, but I got pretty close:

It’s basically what you do in augmented reality, you align the virtual camera’s position and rotation with the results you get from the vision part of the system. In the pose estimation we ended with a 3D rotation vector (Rodrigues form) and 3D translation vector which is used as-is, so only the rotation vector should be converted to 3×3 matrix using the Rodrigues() function.
This is the OpenGL glut display() function that draws the scene:

void display(void)
{
	glClearColor(1.0f, 1.0f, 1.0f, 0.5f);
	glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);	// Clear Screen And Depth Buffer
	//draw the background - the frame from the camers
	glMatrixMode(GL_PROJECTION);
	glPushMatrix();
	gluOrtho2D(0.0,352.0,288.0,0.0);
	glMatrixMode(GL_MODELVIEW);
	glPushMatrix();
	glDisable(GL_DEPTH_TEST);
	glDrawPixels(352,288,GL_RGB,GL_UNSIGNED_BYTE,backPxls.data);
	glEnable(GL_DEPTH_TEST);
	glPopMatrix();
	glMatrixMode(GL_PROJECTION);
	glPopMatrix();
    const double t = glutGet(GLUT_ELAPSED_TIME) / 1000.0;
	a = t*20.0;
	glMatrixMode(GL_MODELVIEW);
	glLoadIdentity();
//use the camera position 3D vector
	curCam[0] = cam[0]; curCam[1] = cam[1]; curCam[2] = cam[2];
//there seems to be some kind of offset...
	glTranslated(-curCam[0]+0.5,-curCam[1]+0.7,-curCam[2]);
//and the 3x3 rotation matrix
	double _d[16] = {	rot[0],rot[1],rot[2],0,
						rot[3],rot[4],rot[5],0,
						rot[6],rot[7],rot[8],0,
						0,	   0,	  0		,1};
	glMultMatrixd(_d);
//flip the rotation on the x-axis
	glRotated(180,1,0,0);
	//draw the 3D feature points
	glPushMatrix();
	glColor4d(1.0,0.0,0.0,1.0);
	for(unsigned int i=0;i<points1Proj.size();i++) {
		glPushMatrix();
glTranslated(points1Proj[i].x,points1Proj[i].y,points1Proj[i].z);
		glutSolidSphere(0.03,15,15);
		glPopMatrix();
	}
	glPopMatrix();
	glutSwapBuffers();
	if(!running) {
		glutLeaveMainLoop();
	}
	Sleep(25);
}

This pretty much coveres my work, in a very concise way. The complete source code will reveal all I have done, and will provide a better copy-and-paste ground for your own projects.

Things not covered in this work

Initially I tried to implement a very crucial part of the PTAM work – pairing the 3D map with 2D features in the image. This allows them to re-align the map in every frame (when the tracking is bad) so the pose estimation does not “loose grip”. In essence, they keep a visual identity for each map feature, very similar to a descriptor like SURF or SIFT, so at any point they can find where in the new image are the features and recover the camera pose from the 2D-3D correspondence. I ran into a problem utilizing OpenCV’s SURF functionality, it seems to have a bug when trying to compute the descriptor for user-given feature points.
Another thing I chose not to implement is creating a full map of the surroundings. I wanted to achieve a simple working solution for a small map (essentially a single frame), and see how it works. In the original work by K&M they constantly add more and more features to the map untill it has covered the whole surrounding room.

Code and Working the Program

As usual my code is available for checkout from the blog’s SNV repo:

svn checkout http://morethantechnical.googlecode.com/svn/trunk/ptam ptam

To get the stereo initialization you must press [spacebar] twice: Once when the camera has stabilized and the features are stable, and another time when the camera has translated and again stabilized.
This marks the 2 keyframes that will be used for stereo init and triangulation.
From that point on, the 3D scene will start and the track-and-estimate stage begins. Try not to move the camera violently as the optical flow may suffer.
Thanks Lior for your help getting the hang of these subjects, and the opportunity to meddle with a subject I long gone wanted to explore.
I hope everyone will enjoy and learn from my enjoyment and learning.
Bye!
Roy.

Tags 3d, augmented reality, code, opencv, opengl, sfm

23 replies on “Implementing PTAM: stereo, tracking and pose estimation for AR with OpenCV [w/ code]”

Nice work man!!!

Great job and thanks for sharing, it is such a pain to run the original PTAM under windows, dont know why it was so slow, this looks much more impressive! 1 comment + 1 question pls.
Comment: would be of great help if you add description how to run this package, such as to press spacebar twice to save two frames and let the programs to go, otherwise you get a broken error if user simply press run and do nothing. it took me some time to figure this out…
Question: still studying but wondering: why you flip the ptam window w.r.t the “aaa” winows, any specific reason?

Patch to correct the “view mismatching” issue:
1. in main.cpp function start_opengl(), change cvFlip line to:
cvFlip(&cvMat(VIDEO_HEIGHT,VIDEO_WIDTH,CV_8UC3,backPxls.data),0,1);
2. in opngl_stereo.cpp function display():
replace this part of codes:
curCam[0] = cam[0]; curCam[1] = cam[1]; curCam[2] = cam[2];
glTranslated(-curCam[0]+0.5,-curCam[1]+0.7,-curCam[2]);
//double _d[16];
//glGetDoublev(GL_MODELVIEW_MATRIX,_d);
double _d[16] = { rot[0],rot[1],rot[2],0,
rot[3],rot[4],rot[5],0,
rot[6],rot[7],rot[8],0,
0, 0, 0 ,1};
glMultMatrixd(_d);
glRotated(180,1,0,0);
to new codes:
double m[16] = { rot[0],-rot[3],-rot[6],0,
rot[1],-rot[4],-rot[7],0,
rot[2],-rot[5],-rot[8],0,
cam[0],-cam[1],-cam[2],1};
glLoadMatrixd(m);
This will correct the mismatching issue, other visible error is mainly due to calibration noise.

Hey thanks man! Your fix works!
I updated the repo with this patch.
Also (this will go as an update on the post), it looks like mobile markerless augmented reality is picking up speed: http://gizmodo.com/5489946/ikat-augmented-reality-app-works-without-real+world-prompt.
Roy.

Awesome writeup and good work!

why am I getting the run time error,
“The variable threadhandle is being used without initialised” ???!!!
I am curious to look the results..pretty good job…But someone help me here..

hi, I am getting the error
“The variable Threadhandle is being used without being initialised” Any help ??

Hi Samantha
It means exactly that: the variable ThreadHandle is not initialized, and that’s true. You can fix it by setting it to NULL when it’s declared.

Thanks Roy…It works now..

Hi Roy,
I’ve 1 other question here..
I see that you are multiplying the modelview matrix with the inverse of the camera transformation (rv and tv) that you found…
But why the negative signs ?? Does just transposing do the job ??

Hi.I am struggling to find out how to show an image from the camera, obtained with opencv, into an opengl window.I managed to make it by getting the image and showing it as a texture, but the proccess is extremely slow.Could you please help me out?

Your triangulation method does really work?
With what opencv versions?
Because undistortPoints gives me normalized coordinates?
I dont see that covered in your code.
greetings.

Dear Roy,
It was fascinating to read about your recent project. I have a great opportunity that I would like to discuss with you incorporating such innovative technology.
Please feel free to contact me.
I really look forward to touching base with you soon.
Best wishes!!

Hi Roy and Arnon,
I’m currently working on an idea in the field of PTAM, and would really happy to discuss with you about it.
Can we please meet sometime?
Thanks,
Ohad Rozen

Hi Roy,
When you add a new 3D point, I notice that you divide point3D_dat by a constant s (which is hard-coded to 7). Could you tell me how you determine this number? And is it something to do with the 5cm translation on the X axis between the 2 views? If I apply a rotation in the camera motion, how is it going to affect this constant?
Thanks,
Simon

When I press the “spacebar” the second time, the program crashes. Opencv Eorror
Assertion failed <CV_IS_MAT> && _cameraMatrix->rows ==3 && _cameraMatrix->cols == 3> in unknown function.

i roy i hope you can give us some help here
when i press “spacebar” the second time , the program stop(crash).
acces viollation 0x00000000.
thanks

@houssam This seems like a memory access problem (i.e. invalid pointer). If you are able to run it in debug mode and pin-point the problem it’s best, but if not just start remarking out code (start from main() ) and see where it breaks.
Keep in mind this project is years old now, using old versions of OpenCV where the API completely changed. So be prepared to rewrite most of it.

i roy
i hope you can give us some help here
when i press “spacebar” the second time , the program stop(crash).
acces viollation 0x00000000.
thanks

How and where are the threads created does it also create keyframes and store it

Hello Sir
I tried the above code but I keep getting the error
Failed to open input file (some file name) wrong file extension
I get this error though I give the correct path for the record3.wmv

Hi the code suddenly stops when I press space the second time
error
:
OpenCV Error: Assertion failed (CV_IS_MAT(_cameraMatrix) && _cameraMatrix->rows
== 3 && _cameraMatrix->cols == 3) in unknown function, file ..\..\..\opencv\modu
les\imgproc\src\undistort.cpp, line 282
How to resolve this

Can someone please reply me on how to solve the above error its urgent