«

»

Jan 25

10 lines-of-code OCR HTTP service with Python, Tesseract and Tornado

Hi

I believe that every builder-hacker should have their own little Swiss-army-knife server that just does everything they need, but as a webservice. You can basically do anything as a service nowadays: image/audio/video manipulation, mock-cloud data storage, offload heavy computation, and so on.
Tornado, the lightweight Python webserver is perfect for this, and since so many of the projects these days have Python binding (see python-tesseract), it should be a breeze to integrate them with minimal work.
Let's see how it's done

Putting it together

I owe the simplicity of this work to the simplicity of Tornado's API. Really clean, just a couple of entry points to write code.
Since this is an extremely short code, I'll just pour it in and go over it:


import tornado.httpserver
import tornado.ioloop
import tornado.web
import pprint
import Image
from tesseract import image_to_string
import StringIO

class MainHandler(tornado.web.RequestHandler):
    def get(self):
        self.write('<html><body>Send us a file!<br/><form enctype="multipart/form-data" action="/" method="post">'
                   '<input type="file" name="the_file">'
                   '<input type="submit" value="Submit">'
                   '</form></body></html>')

    def post(self):
        self.set_header("Content-Type", "text/plain")
        self.write("You sent a file with name " + self.request.files.items()[0][1][0]['filename'] )
	# make a "memory file" using StringIO, open with PIL and send to tesseract for OCR
	self.write(image_to_string(Image.open(StringIO.StringIO(self.request.files.items()[0][1][0]['body']))))

application = tornado.web.Application([
    (r"/", MainHandler),
])

if __name__ == "__main__":
    http_server = tornado.httpserver.HTTPServer(application)
    http_server.listen(8888)
    tornado.ioloop.IOLoop.instance().start()

That's it, and most of it is just garnish. The final version also contains showing the image to the screen.

In the main, Tornado is set up to listen to port 8888, and the application configuration tells it to answer requests on the root ("/") with our special handler: MainHandler. Then I must define MainHandler to take care of GET and POST requests going in. All this was taken off the "Hello World" of Tornado's API.

I will have the service answer to POST requests sending an image file, and route it to be processed. All attached files are on self.request.files, so I just pick up the first one.

Now Tesseract, you probably already know, is an open-source OCR engine that was once built by HP and now picked up by Google. It is good as it is free, and has a set of languages already trained.
But I needed a python binding to it, and did not feel like writing one of my own. So I googled and found this small humble project: python-tesseract. With a very narrow API, just a function to call tesseract that basically calls the tesseract command line. But it works like a charm.

So all I needed to do is take the file off the POST request, wrap a StringIO around it to look like a file, use PIL's Image.open, and send it python-tesseract to return a string. Then I just write the string back to the HTTP response.

	self.write(image_to_string(Image.open(StringIO.StringIO(self.request.files.items()[0][1][0]['body']))))

To get it to actually run you must

  • set up $PYTHONPATH variable to find both python-tesseract and Tornado,
  • change $TESSDATA_PREFIX to where you put your training data for Tesserast,
  • change the path to the tesseract executable in the first code line of python-tesseract's tesseract.py.

Now all you need is to start your server, send image requests to it, and you'll get back the text in the images.

Code

Grab it off the SVN:
svn checkout http://morethantechnical.googlecode.com/svn/trunk/tesserver/ tesserver

Enjoy,
Roy.

Share
  • http://www.openocr.net Traun Leyden

    I created something very similar, except it's written in Go and runs in a docker containers, and is designed to make it easy to spawn multiple worker processes on a cluster of machines.

    It's open source under the Apache 2 license: http://www.openocr.net