Near realtime face detection on the iPhone w/ OpenCV port [w/code,video]

iphone + opencv = winHi
OpenCV is by far my favorite CV/Image processing library. When I found an OpenCV port to the iPhone, and even someone tried to get it to do face detection, I just had to try it for myself.

In this post I'll try to run through the steps I took in order to get OpenCV running on the iPhone, and then how to get OpenCV's face detection play nice with iPhoneOS's image buffers and video feed (not yet OS 3.0!). Then i'll talk a little about optimization

Update: Apple officially supports camera video pixel buffers in iOS 4.x using AVFoundation, here's sample code from Apple developer.
Update: I do not have the xcodeproj file for this project, please don't ask for it. Please see here for compiling OpenCV for the iPhone SDK 4.3.

Let's begin

Cross compiling OpenCV on iPhoneOS

The good people @ computer-vision-software.com have posted a guideline on how to compile OpenCV on iPhone and link them as static libraries, and I followed it. I did have to recompile it with one change - OpenCV needed zlib linkage, and the OpenCV configure script wasn't able to config the makefiles to compile zlib as well. So I downloaded zlib from the net, and just added all the files to the XCode project to compile and link. If you're trying to recreate this, remember to configure/build zlib before adding the files to XCode so you get a zconf.h file. Now OpenCV linked perfectly.
All in all it was really not a big deal to compile OpenCV to the iPhoneOS. I imagined it will be much harder...

OK moving on to

Plain vanilla face detection

So the first step is to just get OpenCV to detect a single face on a single image. But let's make it harder and use UIImage.
So first, I took OCV's facedetect.c example, and added it to the project as is. Then I add 2 peripheral functions to setup and tear down the structs and allocated static memory (things that are done in the main function).

void init_detection(char* cascade_location) {
	cascade = (CvHaarClassifierCascade*)cvLoad( cascade_location, 0, 0, 0 );
	storage = cvCreateMemStorage(0);
}

static IplImage *gray = 0, *small_img = 0;

void release_detection() {
	if (storage)
    {
        cvReleaseMemStorage(&storage);
    }
    if (cascade)
    {
        cvReleaseHaarClassifierCascade(&cascade);
    }
	cvReleaseImage(&gray);
	cvReleaseImage(&small_img);
}

The detect_and_draw function remains exactly the same at this point. I just take the XML files of the haarcascades, and add them to the projects resources.
Now I initialize the detection structs from my UIView or UIViewController that will do the detection. The main NSBundle will find the path to the XML file:

NSString* myImage = [[NSBundle mainBundle] pathForResource:@"haarcascade_frontalface_alt" ofType:@"xml"];
		char* chars = (char*)malloc(512); 
		[myImage getCString:chars maxLength:512 encoding:NSUTF8StringEncoding];
		init_detection(chars);

Awesome, now let's face-detect already! For that all we need is to attach a picture of someone to the projects resources, load it, convert it to IplImage* and hand it over to detect_and_draw - simple.
I used a couple of helper function from the informative post I mentioned earlier:

- (void)manipulateOpenCVImagePixelDataWithCGImage:(CGImageRef)inImage openCVimage:(IplImage *)openCVimage;
- (CGContextRef)createARGBBitmapContext:(CGImageRef)inImage;
- (IplImage *)getCVImageFromCGImage:(CGImageRef)cgImage;
-(CGImageRef)getCGImageFromCVImage:(IplImage*)cvImage;

Now it's only putting it together:

IplImage* im = [self getCVImageFromCGImage:[UIImage imageNamed:"a_picture.jpg"].CGImage];
detect_and_draw(im);
UIImage* result = [UIImage imageWithCGImage:[self getCGImageFromCVImage:im]];

UIImageView* imv = [[UIImageView alloc] initWithImage:result];
[self addSubview:imv];
[imv release];

Just remember those externs, if you don't use a header file:

extern "C" void detect_and_draw( IplImage* img, CvRect* found_face );
extern "C" void init_detection(char* cascade_location);
extern "C" void release_detection();

Sweet. But detecting a face on a single photo is not so difficult - we want video and real-time face detection! So let's do that..

Tying it up with video feed from the iPhone camera (no OS 3.0 yet)

This step was so amazingly simple, it was borderline funny. I used my well-known camera frame grabbing code from Norio Numora. Of course to align it with OS 3.0 you must plug it in to the API Apple provide, and not this wily hack, but it's really a plug-and-play situation. I use it in many of my projects that use the iPhone camera, untill video on the OS 3.0 will be finalized.
So all I needed was to set everything up, make a timer to fire every so-and-so millisec, and send the frame to detection:

- (id)initWithNibName:(NSString *)nibNameOrNil bundle:(NSBundle *)nibBundleOrNil {
    if (self = [super initWithNibName:nibNameOrNil bundle:nibBundleOrNil]) {
        // Initialization code
		ctad = [[CameraTestAppDelegate alloc] init];
		[ctad doInit];
		
		NSString* myImage = [[NSBundle mainBundle] pathForResource:@"haarcascade_frontalface_alt" ofType:@"xml"];
		char* chars = (char*)malloc(512); 
		[myImage getCString:chars maxLength:512 encoding:NSUTF8StringEncoding];
		init_detection(chars);		
		
		[self.view addSubview:[ctad getPreviewView]];
		[self.view sendSubviewToBack:[ctad getPreviewView]];
		
		repeatingTimer = [NSTimer scheduledTimerWithTimeInterval:0.0909 target:self selector:@selector(doDetection:) userInfo:nil repeats:YES];
}

-(void)doDetection:(NSTimer*) timer {
	if([ctad getPixelData]) {
		if(!im) {
			im = cvCreateImageHeader(cvSize([ctad getVideoSize].width,[ctad getVideoSize].height), 8, 4);
		}
		cvSetData(im, [ctad getPixelData],[ctad getBytesPerRow]);
		CvRect r;
		detect_and_draw(im,&r);
		if(r.width > 0 && r.height > 0) {
			NSLog(@"Face: %.0f,%.0f,%.0f,%.0f",r.x,r.y,r.width.r.height);
		}
	}
}

See that for optimization sake, I only create the IplImage header once (the if goes in only in the first time), and every frame after that I only set the IplImage data by taking the buffer I got from the camera. This way the IplImage is sharing buffers, so there is also a little memory optimization there.
From that point on you can take it anywhere you like. Add stuff to faces, mark the face in the image, etc.

But... there's the issue of performance. This method will get you very very bad timings. In the area of 5-15 seconds (!!) for a single frame - which is horrendous. And I promised near real time performance. So without further ado,

Optimizing the hell out of the detection algorithm

Well the guys at computer-vision-software.com have done some work in the field of optimizing OpenCV's haar-based detection, but never released code. Their method was based on the fact that the iPhone's CPU can handle integers far better than floating-points, so they set out to change the algorithm to use integers. I also did that, and found that it only shaves off a few millisec of the total time. The far more influencing factor is the window size of the features scan, the scaling factor of the window size, and the derived number of passes.

Let me explain a little bit how the detection works in OpenCV. First you set the minimal size of the window. Then you specify a scale factor. OpenCV uses this scale factor to do multiple passes over the image to scan for feature-hits. It take the window size, say 30x30, and the factor, say 1.1, and starts multiplying the window size by the factor until it reaches the size of the image. So for a 256x256 image you get: 30x30 scan, 33x33, 36x36, 39x39, 43x43... 244x244 - a total of 23 passes, for one frame! This is way too much... This is done to get better and finer results, and it may be good for resource abundant systems, but this is not our case.

So first thing I did was slash down on those scans. There is, as expected a very strong impact on the quality of the results, but the times are getting close to acceptable. After all my optimizations I got the timing down to even ~120ms.
I optimized a few things:

  • The size of the input image, originally ~300x400, was cut down by 1.5
  • The scale factor for cvHaarDetectObjects: I played with values ranging from 1.2 to 1.5, with pleasing timings
  • The ROI (region of interest) in the IplImage to scan was set every frame to have the previous frame's detection, the location of the face, plus some buffer on the sides to allow movement of the face frame-to-frame. This decreases the scanned area from the whole image to just a small portion that contains the known face. Of course if a face was not found the ROI is reset.
  • I change the internal works of the cvHaarDetectObjects algorithm to do a lot less floats multiplications and turned them into integer multiplications.
  • I dawned upon me just the other day that I can also optimize the size of the search window, and not keep it constant from frame to frame (30x30). If the last frame had found a 36x36 face, the next detection should also try for a 36x36 object. I haven't tried it yet.
  • Memory optimization: don't alloc buffers every frame, share buffers, etc.

So first the most influential change, is in the detection phase:

void detect_and_draw( IplImage* img, CvRect* found_face )
{
	static CvRect prev;
	
	if(!gray) {
		gray = cvCreateImage( cvSize(img->width,img->height), 8, 1 );
		small_img = cvCreateImage( cvSize( cvRound (img->width/scale),
							 cvRound (img->height/scale)), 8, 1 );
	}

	if(prev.width > 0 && prev.height > 0) {
		cvSetImageROI(small_img, prev);

		CvRect tPrev = cvRect(prev.x * scale, prev.y * scale, prev.width * scale, prev.height * scale);
		cvSetImageROI(img, tPrev);
		cvSetImageROI(gray, tPrev);
	} else {
		cvResetImageROI(img);
		cvResetImageROI(small_img);
		cvResetImageROI(gray);
	}
	
    cvCvtColor( img, gray, CV_BGR2GRAY );
    cvResize( gray, small_img, CV_INTER_LINEAR );
    cvEqualizeHist( small_img, small_img );
    cvClearMemStorage( storage );

		CvSeq* faces = mycvHaarDetectObjects( small_img, cascade, storage,
										   1.2, 0, 0
										   |CV_HAAR_FIND_BIGGEST_OBJECT
										   |CV_HAAR_DO_ROUGH_SEARCH
										   //|CV_HAAR_DO_CANNY_PRUNING
										   //|CV_HAAR_SCALE_IMAGE
										   ,
										   cvSize(30, 30) );
		
	if(faces->total>0) {
		CvRect* r = (CvRect*)cvGetSeqElem( faces, 0 );
		int startX,startY;
		if(prev.width > 0 && prev.height > 0) {
			r->x += prev.x;
			r->y += prev.y;
		}
		startX = MAX(r->x - PAD_FACE,0);
		startY = MAX(r->y - PAD_FACE,0);
		int w = small_img->width - startX - r->width - PAD_FACE_2;
		int h = small_img->height - startY - r->height - PAD_FACE_2;
		int sw = r->x - PAD_FACE, sh = r->y - PAD_FACE;
		prev = cvRect(startX, startY, 
					  r->width + PAD_FACE_2 + ((w < 0) ? w : 0) + ((sw < 0) ? sw : 0),
					  r->height + PAD_FACE_2 + ((h < 0) ? h : 0) + ((sh < 0) ? sh : 0));
		printf("found face (%d,%d,%d,%d) setting ROI to (%d,%d,%d,%d)\n",r->x,r->y,r->width,r->height,prev.x,prev.y,prev.width,prev.height);
		found_face->x = (int)((double)r->x * scale);
		found_face->y = (int)((double)r->y * scale);
		found_face->width = (int)((double)r->width * scale);
		found_face->height = (int)((double)r->height * scale);
	} else {
		prev.width = prev.height = found_face->width = found_face->height = 0;
	}
}

As you can see I keep the previous face in prev, and use it to set the ROI of the images for the next frame. Note that the small_img is a scaled-down version of the input image, so the detection results must be scaled-up to match the real size of the input.

Now, I can bore you with the details of how I changed the cvHaarDetectObjects to use more integers, but I won't. Anyway it's all in the code, that is freely available, so you can diff it against cvHarr.cpp of OpenCV and see the changes. In short what I did was:

  • Mark out image scaling and canny pruning.
  • in the cvSetImagesForHaarClassifierCascade, which fires many times for each frame and is governed on scaling/shifting/rotating the Haar classifiers to get better detection, I changed the weights and sizes to be integers rather than floats.
  • in cvRunHaarClassifierCascade, which calculates the score for a single Haar feature-hit, I changed the results calculation to integers instead of floats.
  • I played around with integer oriented calculations of the sqrt function, that the cvRunHaarClassifierCascade func uses (fires many many times each frame), but that actually caused a slow-down on the device. Turns out the standard library (math.h) implementation is the best

Well guys, that's pretty much all my discovery in the field. Please keep working on it. I'm anxious to see a true real-time face detection on the iPhone.

Time for a video proof? you bet

Here's proof that all I wrote here is not total BS

Code

Code is as usual available in Google Code SVN repo:
http://code.google.com/p/morethantechnical/source/browse/#svn/trunk/FaceDetector-iPhone

OK, 'Till next time, enjoy
Roy.

Share