Friday, 18 February 2011

OCR of scanned PDFs in Linux

It seems there is still no quick-and-ready solution, but found a few interesting scripts.

This script based on Tesseract worked well for me. It requires to have Tesseract and ghostscript installed, and returns a number of ASCII text files from the PDF. Given that the OCR engine is the same used by Google, you can be assured it works pretty well.

A bit less comfy solution can be found on this article, with some shell script based on Tesseract as well.

Another solution using other engines.

It seems also there is a potentially elegant GUI solution by means of OCRFeeder, but I still haven't tried it. I'll let you know how it works, for now I just bookmark these links.

No comments: