code graphics opencv programming Recommended video vision Website work

Hand gesture recognition via model fitting in energy minimization w/OpenCV

hands with model fittedHi
Just wanted to share a thing I made – a simple 2D hand pose estimator, using a skeleton model fitting. Basically there has been a crap load of work on hand pose estimation, but I was inspired by this ancient work. The problem is setting out to find a good solution, and everything is very hard to understand and implement. In such cases I like to be inspired by a method, and just set out with my own implementation. This way, I understand whats going on, simplify it, and share it with you!
Anyway, let’s get down to business.
Edit (6/5/2014): Also see some of my other work on hand gesture recognition using smart contours and particle filters

A bit about energy minimization problems

A dear friend revealed before me the wonders of energy minimization problems a while back, and ever since I have trying to find uses for that method. Basically, it is trying to find a global minimum for a complicated energy function (usually with many parameters), by following the function’s gradient. Such methods are often called Gradient Descent, and used mostly for non-linear systems that can’t be solved easily using a least-squares variant.
A lot of work in computer vision was done using energy functions (I believe the most seminal was Snakes, over 10,000 citations), usually having two terms: Internal energy and External energy. The equilibrium between the two terms should result in a low-energy system – our optimal result. So we would like to formulate the terms in our system such that when they are 0 – they describe the system as we want it.
Following the works with active contours, I believe the external energy function should have to do with how the hand model fits to the hand blob, and the internal energy will have to do with how “comfortable” the hand is with this configuration.

The hand model

Let’s see how a 2D model of a hand might look like

Kinda looks like a rake… huh?
There are some parts that practically can’t change much, i.e the palm (orange), and some that might change drastically, i.e the fingers (red). Each finger has joints (blue circle), and a tip (bigger blue circle).

typedef struct finger_data {
	Point2d origin_offset;		//base or finger relative to center hand
	double a;					//angle
	vector<double> joints_a;	//angles of joints
	vector<double> joints_d;	//bone length
typedef struct hand_data {
	FINGER_DATA fingers[5];		//fingers
	double a;					//angle of whole hand
	Point2d origin;				//center of palm
	Point2d origin_offset;		//offset from center for optimization
	double size;				//relative size of hand = length of a finger

At first I thought, since I’m only interested in the tips of the fingers, to use Inverse Kinematics to guide the tips to a certain point and let the joints find their own minimal energy position, following this article. But I abandoned this method because of complications.
I also had to simplify this model, for real-time estimation and also better results. So in the end I ended up with a very rigid model, that allows only on joint per finger and no angular movement.

Using tnc.c

tnc.c is a “library”, essentially one c file, that implements a line search algorithm that is able to find the minimum point of a multi-variate function. I’m not certain of the algorithm details, and it’s not so important as it can be replaced with any other similar library. But, tnc.c has a great advantage – it is dead simple. One function will start the gradient decent, calling-back a function to calculate the gradients.
So basically I had to write just one very short function:

static int my_f(double x[], double *f, double g[], void *state) {
	DATA_FOR_TNC* d_ptr = (DATA_FOR_TNC*)state;
	DATA_FOR_TNC new_data = *d_ptr;
	*f = calc_Energy(new_data,*d_ptr);
	//calc gradients
		double _x[SIZE_OF_HAND_DATA];
		for(int i=0;i<SIZE_OF_HAND_DATA;i++) {
			memcpy(_x, x, sizeof(double)*SIZE_OF_HAND_DATA); //reset variables
			_x[i] = _x[i] + EPSILON; //change only one variable
			mapVecToData(_x, new_data.hand);
			double E_epsilon = calc_Energy(new_data,*d_ptr);
			g[i] = ((E_epsilon - *f) / EPSILON); //calc the gradient for this variable change
	return 0;

This function is called by tnc.c on every iteration of the search, the double x[] is the state of variables the search is now examining, double* f is the energy for this state, double g[] are the gradients (same size as x[]), and voide* state is a user-defined variable that can be carried along the process.
So what I did is simply changed the value of each parameter in turn, to test how it effects the energy in the system. I get a measure of the energy, then I subtract it from the “natural” setup (without any changes to parameters) energy measure, and I get the gradient for this parameter.
The energy function came out a bit different in the end:

static double calc_Energy(DATA_FOR_TNC& d, DATA_FOR_TNC& orig_d) {
	double _sum = 0.0;
	//external energy: how close are the joints to the hand blob? (how well do they fit to it)
	vector<Point2d> joints;
	Mat tips(5,1,CV_64FC2);
	for (int j=0; j<5; j++) {
		FINGER_DATA f = d.hand.fingers[j];
		Point2d _newTip = newTip(f,d.hand,joints); //get joints for this finger
		for (int i=0; i<tmp.size(); i++) { //for each joint find how far it is from the blob
			double ds = pointPolygonTest(d.contour, tmp[i]+getHandOrigin(d.hand), true);
			ds += 5;
			ds = 1 * ((ds < 0) ? -1 : 1) * (ds*ds) ;
			_sum -= (ds > 0) ? 0 : 100*ds;
		}<Point2d>(j,0) = _newTip;
	//lazyness of fingers - joints should strive to be as they were in the natural pose
	vector<double> _angles;
//	for (int j=0; j<5; j++) {
//		FINGER_DATA f = d.hand.fingers[j];
//		FINGER_DATA of = orig_d.hand.fingers[j];
////		_angles.push_back(f.a - of.a);
//		for (int i=0; i<f.joints_d.size(); i++) {
////			_angles.push_back(f.joints_a[i] - of.joints_a[i]);
//			_angles.push_back(f.joints_d[i] - of.joints_d[i]);
//		}
//	}
	_angles.push_back(d.hand.a-orig_d.hand.a); //the angle of the hand should be as it was before
	_sum  += 10000*norm(Mat(_angles));
	if(_sum < 0) return 0;
	return _sum;

You’ll notice the commented out section. The “laziness of fingers” turned out not to give good results… A different metric is needed! I have not found it yet, maybe you have a good idea?
Starting tnc.c is very simple: Allocating the vectors for X and gradients, initializing the model from the blob, and calling the simple_tnc convenience method. simple_tnc starts tnc with some default parameters that don’t affect the outcome (at least in my tries).

void estimateHand(Mat& mymask) {
	double _x[SIZE_OF_HAND_DATA] = {0};
	Mat X(1,SIZE_OF_HAND_DATA,CV_64FC1,_x);
	double f;
	Mat gradients(Size(SIZE_OF_HAND_DATA,1),CV_64FC1,Scalar(0));
	initialize_hand_data(d, mymask);
	mapDataToVec((double*), d.hand);
	simple_tnc(SIZE_OF_HAND_DATA, (double*), &f, (double*), my_f, (void*)&d, 1, 0);
	mapVecToData((double*), d.hand);
	d.hand.origin = getHandOrigin(d.hand); //move to new position

Results and Discussion

Here are my results so far:

It’s not perfect, but it’s a start. Tracking and estimating open hand is pretty good, with some orientation change as well. But when the fingers are closed… that’s where problems start.
Sometimes the joints “hover” over the black area to “land” in a white area so they “fit”, but they should not do that. One easy thing to do to counter this is to measure the distance of the whole bone, and not just the joint.
The model right now doesn’t use all the joints possible, because it is too heavy computationally. Plus the energy does not depend (or change) the angle of the fingers. So this is a very very simple model of a hand…
But, it is a good start! All the other stuff I have seen online is just basic high-curvature points counting and color-based or feature-based segmentation and tracking… My model actually tries to fit an articulate and precise model of a hand to the image.

How did you get such nice blobs?!

You ask. They are beautiful aren’t they… nice and clean, easy for tracking and model fitting. It’s no magic though…
Well, I took part of a project in the Media Lab, called DepthJS, that uses the MS Kinect to control web pages. I wrote the computer-vision part. So all the code is there, you can grab it, I just plugged it into this little project. Basing off this very simple example of using OpenCV2.X and libfreenect.
Wow, this was a longie.. I hope you learned something and got inspired. I got to do a second overview of the project, and I’m inspired. Inspiration all around!
Code is obviously yours for the taking:
Please contribute your own views, thoughts, code, rants in the comments and github page.