«

»

Nov 21

Creating a searchable PDF with opensource tools ghostscript, hocr2pdf and tesseract-ocr

I bet creating searchable PDFs has been done many times over, even so I'd like to share the way I did it recently with strictly open source tools. The pipeline is simple: GS to separate the PDF to pages, tesseract OCR to extract text, hocr2pdf to create a merged PDF and GS again to bundle everything back to unified PDF. If you're creating a PDF from scanned books, this project may also be of help: unpaper

Edit 5/21/2014: I've had good experience using Scantailor, which is available on homebrew for the Mac. And also, I've submitted hocr2pdf to homebrew as part of the exact-image library (the name of the formula is "exact-image").

A script

Please excuse the Bash, but DOS or other types of scripts should work similarly.

#!/bin/sh

# bash tut: http://linuxconfig.org/bash-scripting-tutorial
# Linux PDF,OCR: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/

y="`pwd`/$1"
echo Will create a searchable PDF for $y

x=`basename "$y"`
name=${x%.*}

mkdir "$name"
cd "$name"

# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_%04d.jpg -f "$y"

# process each page
for f in $( ls *.jpg ); do
  # extract text
  tesseract -l eng -psm 3 $f ${f%.*} hocr

  # remove the “<?xml” line, it disturbed hocr2df
  grep -v "<?xml" ${f%.*}.html > ${f%.*}.noxml
  rm ${f%.*}.html 

  # create a searchable page
  hocr2pdf -i $f -s -o ${f%.*}.pdf < ${f%.*}.noxml
  rm ${f%.*}.noxml
  rm $f
done

# combine all pages back to a single file
# from http://www.ehow.com/how_6874571_merge-pdf-files-ghostscript.html
gs -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=../${name}_searchable.pdf *.pdf 

cd ..
rm -rf $name

Usage is quite simple:

./make_searchable.sh my_non_searchable.pdf
Share
  • Sascha

    Hello, is there a possibility to select a Folder with many "my_non_searchable.pdf" - so that every PDF get searchable at the End ?

    Dokument -> Scanner -> Dropbox -> Raspberry -> NAS (FOLDER (non_searchable) -> Raspberry 2 ORC -> NAS FOLDER ( searchable ).

  • Pingback: OCR on PDFs in OS X with free, open source tools - FAQs System()

  • http://dicasempesquisa.tumblr.com Henrique Gomide

    Hey Roy,

    Thank you soo much! With this script I was able to help a friend of mine who study history of psychology!

  • http://it-tactics.blogspot.com Martin Wildam

    Hi, I tried this several times with different variants of parameters but I always get results where the hidden text does not match position with the image.

  • Lucas

    Hello Roy, I tried that changing the language to por (portuguese), but it doesn't work.... it takes a long time, creates the new PDF, but there is no "text" inside it.
    Even if I process it in english.... have you had some issue like that?

    Tks!

  • http://www.morethantechnical.com Roy

    @Lucas
    You should make sure Tesseract has the Portuguese language files from here