Creating a searchable PDF with opensource tools ghostscript, hocr2pdf and tesseract-ocr

I bet creating searchable PDFs has been done many times over, even so I'd like to share the way I did it recently with strictly open source tools. The pipeline is simple: GS to separate the PDF to pages, tesseract OCR to extract text, hocr2pdf to create a merged PDF and GS again to bundle everything back to unified PDF. If you're creating a PDF from scanned books, this project may also be of help: unpaper

Edit 5/21/2014: I've had good experience using Scantailor, which is available on homebrew for the Mac. And also, I've submitted hocr2pdf to homebrew as part of the exact-image library (the name of the formula is "exact-image").

A script

Please excuse the Bash, but DOS or other types of scripts should work similarly.

#!/bin/sh

# bash tut: http://linuxconfig.org/bash-scripting-tutorial
# Linux PDF,OCR: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/

y=&quot;<code>pwd</code>/$1&quot;
echo Will create a searchable PDF for $y

x=<code>basename &quot;$y&quot;</code>
name=${x%.*}

mkdir &quot;$name&quot;
cd &quot;$name&quot;

# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_%04d.jpg -f &quot;$y&quot;

# process each page
for f in $( ls *.jpg ); do
  # extract text
  tesseract -l eng -psm 3 $f ${f%.*} hocr

  # remove the “&lt;?xml” line, it disturbed hocr2df
  grep -v &quot;&lt;?xml&quot; ${f%.*}.html &gt; ${f%.*}.noxml
  rm ${f%.*}.html 

  # create a searchable page
  hocr2pdf -i $f -s -o ${f%.*}.pdf &lt; ${f%.*}.noxml
  rm ${f%.*}.noxml
  rm $f
done

# combine all pages back to a single file
# from http://www.ehow.com/how_6874571_merge-pdf-files-ghostscript.html
gs -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=../${name}_searchable.pdf *.pdf 

cd ..
rm -rf $name

Usage is quite simple:

./make_searchable.sh my_non_searchable.pdf
Share