Categories
code

Creating a searchable PDF with opensource tools ghostscript, hocr2pdf and tesseract-ocr

How to create searchable PDFs with open source tools: GhostScript, tesseract-ocr and hocr2df

I bet creating searchable PDFs has been done many times over, even so I’d like to share the way I did it recently with strictly open source tools. The pipeline is simple: GS to separate the PDF to pages, tesseract OCR to extract text, hocr2pdf to create a merged PDF and GS again to bundle everything back to unified PDF. If you’re creating a PDF from scanned books, this project may also be of help: unpaper
Edit 5/21/2014: I’ve had good experience using Scantailor, which is available on homebrew for the Mac. And also, I’ve submitted hocr2pdf to homebrew as part of the exact-image library (the name of the formula is “exact-image”).

A script

Please excuse the Bash, but DOS or other types of scripts should work similarly.

#!/bin/sh
# bash tut: http://linuxconfig.org/bash-scripting-tutorial
# Linux PDF,OCR: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/
y="`pwd`/$1"
echo Will create a searchable PDF for $y
x=`basename "$y"`
name=${x%.*}
mkdir "$name"
cd "$name"
# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_%04d.jpg -f "$y"
# process each page
for f in $( ls *.jpg ); do
  # extract text
  tesseract -l eng -psm 3 $f ${f%.*} hocr
  # remove the “<?xml” line, it disturbed hocr2df
  grep -v "<?xml" ${f%.*}.html > ${f%.*}.noxml
  rm ${f%.*}.html
  # create a searchable page
  hocr2pdf -i $f -s -o ${f%.*}.pdf < ${f%.*}.noxml
  rm ${f%.*}.noxml
  rm $f
done
# combine all pages back to a single file
# from http://www.ehow.com/how_6874571_merge-pdf-files-ghostscript.html
gs -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=../${name}_searchable.pdf *.pdf
cd ..
rm -rf $name

Usage is quite simple:

./make_searchable.sh my_non_searchable.pdf

6 replies on “Creating a searchable PDF with opensource tools ghostscript, hocr2pdf and tesseract-ocr”

Hello, is there a possibility to select a Folder with many “my_non_searchable.pdf” – so that every PDF get searchable at the End ?
Dokument -> Scanner -> Dropbox -> Raspberry -> NAS (FOLDER (non_searchable) -> Raspberry 2 ORC -> NAS FOLDER ( searchable ).

Hello Roy, I tried that changing the language to por (portuguese), but it doesn’t work…. it takes a long time, creates the new PDF, but there is no “text” inside it.
Even if I process it in english…. have you had some issue like that?
Tks!

Leave a Reply

Your email address will not be published. Required fields are marked *