A simple object classifier with Bag-of-Words using OpenCV 2.3 [w/ code]

Just wanted to share of some code I've been writing.
So I wanted to create a food classifier, for a cool project down in the Media Lab called FoodCam. It's basically a camera that people put free food under, and they can send an email alert to the entire building to come eat (by pushing a huge button marked "Dinner Bell"). Really a cool thing.

OK let's get down to business.

I followed a very simple technique described in this paper. I know, you say, "A Paper? Really? I'm not gonna read that technical boring stuff, give the bottom line! man.. geez." Well, you are right, except that this paper IS the bottom line, it's dead simple. It's almost a tutorial. It is also referenced by the OpenCV documentation.

Edit (6/5/2014): Another great read for selecting the best color-space and invariant features is this paper by van de Sande et al.

The method is simple:
- Extract features of choice from training set that contains all classes.
- Create a vocabulary of features by clustering the features (kNN, etc). Let's say 1000 features long.
- Train your classifiers (SVMs, Naive-Bayes, boosting, etc) on training set again (preferably a different one), this time check the features in the image for their closest clusters in the vocabulary. Create a histogram of responses for each image to words in the vocabulary, it will be a 1000-entries long vector. Create a sample-label dataset for the training.
- When you get an image you havn't seen - run the classifier and it should, god willing, give you the right class.

Turns out, those crafty guys in WillowGarage have done pretty much all the heavy lifting, so it's up for us to pick the fruit of their hard work. OpenCV 2.3 comes packed with a set of classes, whose names start with BOW for Bag Of Words, that help a lot with implementing this method.

Starting with the first step:

Mat training_descriptors(1,extractor->descriptorSize(),extractor->descriptorType());

SurfFeatureDetector detector(400);
vector keypoints;

// computing descriptors
Ptr extractor(
   new OpponentColorDescriptorExtractor(
      Ptr(new SurfDescriptorExtractor())

while(..loop a directory? a file?..) {
   Mat img = imread(filepath);
   detector.detect(img, keypoints);
   extractor->compute(img, keypoints, descriptors);

Let's go create a vocabulary then. Luckily, OpenCV has taken care of that, and provide a simple API:

BOWKMeansTrainer bowtrainer(1000); //num clusters
Mat vocabulary = bowtrainer.cluster();

Boom. Vocabulary.
Now, let's train us some SVM classifiers!
We're gonna train a 2-class SVM, in a 1-vs-all kind of way. Meaning we train an SVM that can say "yes" or "no" when choosing between one class and the rest of the classes, hence 1-vs-all.
But first, we need to scour the training set for our histograms (the responses to the vocabulary, remember?):

vector<KeyPoint> keypoints;
Mat response_hist;
Mat img;
string filepath;
map<string,Mat> classes_training_data;

Ptr<FeatureDetector > detector(new SurfFeatureDetector());
Ptr<DescriptorMatcher > matcher(new BruteForceMatcher<L2<float> >());
Ptr<DescriptorExtractor > extractor(new OpponentColorDescriptorExtractor(Ptr<DescriptorExtractor>(new SurfDescriptorExtractor())));
Ptr<BOWImgDescriptorExtractor> bowide(new BOWImgDescriptorExtractor(extractor,matcher));

#pragma omp parallel for schedule(dynamic,3)
for(..loop a directory?..) {
   img = imread(filepath);
   bowide.compute(img, keypoints, response_hist);

   #pragma omp critical
      if(classes_training_data.count(class_) == 0) { //not yet created...

Now, two things:
First notice I'm keeping the training data for each class separately, this is because we will need this for later creating the 1-vs-all samples-labels matrices.
Second, I use OpenMP multi(-threading)processing to make the calculation parallel, and hence faster, on multi-core machines (like the one I used). Time is sliced by a whole lot. OpenMP is a gem, use it more. Just a couple of #pragma directives and you're multi-threading.

Alright, data gotten, let's get training:

#pragma omp parallel for schedule(dynamic)
for (int i=0;i<classes_names.size();i++) {
   string class_ = classes_names[i];
   cout << omp_get_thread_num() << " training class: " << class_ << ".." << endl;
   Mat samples(0,response_cols,response_type);
   Mat labels(0,1,CV_32FC1);
   //copy class samples and label
   cout << "adding " << classes_training_data[class_].rows << " positive" << endl;
   Mat class_label = Mat::ones(classes_training_data[class_].rows, 1, CV_32FC1);
   //copy rest samples and label
   for (map<string,Mat>::iterator it1 = classes_training_data.begin(); it1 != classes_training_data.end(); ++it1) {
      string not_class_ = (*it1).first;
      if(not_class_.compare(class_)==0) continue; //skip class itself
      class_label = Mat::zeros(classes_training_data[not_class_].rows, 1, CV_32FC1);
   cout << "Train.." << endl;
   Mat samples_32f; samples.convertTo(samples_32f, CV_32F);
   if(samples.rows == 0) continue; //phantom class?!
   CvSVM classifier; 

   //do something with the classifier, like saving it to file

Again, I parallelize, although the process is not too slow.
Note how I build the samples and the labels, where each time I put in the positive samples and mark the labels '1', and then I put the rest of the samples and label them '0'.

Moving on to .... testing the classifiers!
Nothing seems to me like more fun than creating a confusion matrix! Not really, but let's see how it's done:

map<string,map<string,int> > confusion_matrix; // confusionMatrix[classA][classB] = number_of_times_A_voted_for_B;
map<string,CvSVM> classes_classifiers; //This we created earlier

vector<string> files; //load up with images
vector<string> classes; //load up with the respective classes

for(..loop over a directory?..) {
   Mat img = imread(files[i]),resposne_hist;
   vector<KeyPoint> keypoints;
   bowide->compute(img, keypoints, response_hist);

   float minf = FLT_MAX; string minclass;
   for (map<string,CvSVM>::iterator it = classes_classifiers.begin(); it != classes_classifiers.end(); ++it) {
      float res = (*it).second.predict(response_hist,true);
      if (res < minf) {
         minf = res;
         minclass = (*it).first;

When you take a look in my files, you will find a much complicated way of doing this. But this is the core idea - look in the image for the response histogram to the vocabulary of features (rather, feature-cluster-ceneters), run it by all the classifiers and take the one with the best score. Simple.
Consider making this parallel as well. No reason for it to be serial.

That's about covers it.


Lately I'm pushing stuff in Github.com using git rather than SVN on googlecode. Donno why, it's just like that.
Get the whole thing at:
<a href="https://github.com/royshil/FoodcamClassifier" target="_blank">https://github.com/royshil/FoodcamClassifier</a>

Follow the build instructions, they're a breeze, and then follow the runnning instructions. It's basically a series of command-line programs you run to get through each step, and in the end you have like a "predictor" service that takes an image and produces a prediction.

Edit (6/5/2014): The dataset can be downloaded from: http://www.media.mit.edu/~roys/shared/foodcamimages.zip

OK guys, have fun classifying stuff!