Make your PDFs searchable using OCR for free on macOS

Requirements and basic usage
Advanced: handling encrypted PDFs + Deep Search
End result

Ever encountered a PDF from a scanned document where Ctrl+F does not work? No text selection too?

What you need is to use OCR to make a PDF searchable! There are many paid options for the Mac, but if you are willing to use the command line… you can get some quite good and fast results using parallel processing.

All for free!

Requirements and basic usage

First, install Homebrew and then install ocrmypdf.

brew install ocrmypdf

Usage example:

ocrmypdf input.pdf output.pdf

Usage example for ocrmypdf

This was tested in macOS Catalina (10.15.4).

Advanced: handling encrypted PDFs + Deep Search

Sometimes the PDFs will be encrypted, and ocrmypdf will complain after the fact. In those cases, you can run these command to decrypt and then apply OCR to all PDFs in the current directory. - These commands will also perform OCR on all PDFs within the current folder and its subfolders, i.e. with deep search:

# to delete any existing OCR'd files and redo run (if there are already semi-processed files due to errors, for example)
# -print0 + -0 in xargs -> handle files with spaces
find . -name "*-decrypted.pdf" -print0 | xargs -0 -I{} rm -f {};
find . -name "*-ocr.pdf" -print0 |  xargs -0 -I{} rm -f {};
# apply OCR in parallel (all available cores will be used - Command valid only for macos, for Linux replace sysctl -n hw.logicalcpu with nproc)
find . -name "*.pdf" -print0 |  xargs -0 -P $(sysctl -n hw.logicalcpu) -n 1 -I{} qpdf --decrypt {} {}-decrypted.pdf;
find . -name "*-decrypted.pdf" -print0 |  xargs -0 -P $(sysctl -n hw.logicalcpu) -n 1 -I{} ocrmypdf {} {}-ocr.pdf;
# remove intermediate files (decrypted pdfs)
find . -name "*-decrypted.pdf" -print0 | xargs -0 -I{} rm -f {};

If everything goes as planned, you will have parallel processing of OCR for your PDFs:

Parallel processing of OCR

All cores loaded

Tip: As a sidenote, I believe this uses AVX instructions, because this i9-7900X topped out at 4.0GHz on 10 cores, while the OS requested 4.5GHz. I run a negative AVX offset of -5, because even with watercooling, you can’t tame this toaster.

End result

And you will have proper text selection and search, even in images with captions:

Text selection now working

Tell me how it goes for you!

Make your PDFs searchable using OCR for free on macOS

Requirements and basic usage

Advanced: handling encrypted PDFs + Deep Search

End result

Similarity determined via Jaccard similarity between this page's tags and the tags of every other page in this website.

Homebrew: A list of useful casks for NodeJS developers

SSH and SCP broken: Fixing `dyld: Library not loaded`

Java versions on macOS without pain, complete guide

Homebrew: Fix Command Line Tools installation on macOS Ventura Beta

Squid proxy cache in an ancient, hacked up Mac Mini 1.1

Requirements and basic usage

Advanced: handling encrypted PDFs + Deep Search

End result

Share this:

Homebrew: A list of useful casks for NodeJS developers

SSH and SCP broken: Fixing `dyld: Library not loaded`

Java versions on macOS without pain, complete guide

Homebrew: Fix Command Line Tools installation on macOS Ventura Beta

Squid proxy cache in an ancient, hacked up Mac Mini 1.1