Showing posts with label pdf. Show all posts
Showing posts with label pdf. Show all posts

Friday, 18 February 2011

OCR of scanned PDFs in Linux

It seems there is still no quick-and-ready solution, but found a few interesting scripts.

This script based on Tesseract worked well for me. It requires to have Tesseract and ghostscript installed, and returns a number of ASCII text files from the PDF. Given that the OCR engine is the same used by Google, you can be assured it works pretty well.

A bit less comfy solution can be found on this Linux.com article, with some shell script based on Tesseract as well.

Another solution using other engines.

It seems also there is a potentially elegant GUI solution by means of OCRFeeder, but I still haven't tried it. I'll let you know how it works, for now I just bookmark these links.

Monday, 12 May 2008

EPS files and pdflatex

There is this odd quirk in LaTeX. The latex executable compiles your .tex files in the old-fashioned DVI format. As such, it accepts by default only .eps (Encapsulated PostScript) images. pdflatex compiles your .tex files in the standard PDF format. For some mysterious quirk, pdflatex accepts raster formats like .png and .jpg , but does not accept .eps!

Sometimes you want the best of both worlds. An undergraduate of my lab, after some googling, found you can force pdflatex to insert .eps files happily:

1)Install texlive-extra packages, or any other package containing the epstopdf utility.

2) Insert the following code in your .tex file:

\newif\ifpdf
\ifx\pdfoutput\undefined
\pdffalse
\else
\pdfoutput=1
\pdftrue
\fi
\ifpdf
\usepackage{graphicx}
\usepackage{epstopdf}
\DeclareGraphicsRule{.eps}{pdf}{.pdf}{`epstopdf #1}
\pdfcompresslevel=9
\else
\usepackage{graphicx}
\fi

3) Compile using pdflatex with the -shell-escape command line option

It seems to work.

Tuesday, 18 September 2007

chm2pdf 0.0.3 is out!

Sorry for the long delay, but this little script is not dead. chm2pdf
0.0.3 is out: now with image support! (Intra-pdf links still dead, anyway. But I had suggestions on how to solve them.)

Download it here.

The script logic changed quite a lot, and it is still very rough
around the edges (0.1 is still -relatively- far). I'd like everyone of
you to test it as much as possible, since it is very likely that bugs
and quirks will occur. My knowledge of the CHM file format is
tentative at best.

Let me know how it works for you.

As for the xtopdf collaboration, I have to reconnect myself with the xtopdf developer to see. I'd like to have chm2pdf in a decent state (0.1) before, anyway.

Monday, 6 August 2007

chm2pdf merges with xtopdf ?

xtopdf is a neat collection of Python utilities currently in development whose aim is to convert every sensible document format into a PDF file. Nice project. I'm currently beginning talks with its main developer to see if my little chm2pdf can find a space in there..

In the meantime chm2pdf already recieved 31 downloads! Really unexpected success but only a little feedback...

Thursday, 2 August 2007

Announcing chm2pdf

Again in the world of interconverting formats, but this time it is something little I did myself. I had the necessity to print a CHM file but printing it page-by-page is of course painstaking. How I would have liked a PDF of that file... but how?

Well, I scratched my itch and I wrote a very little and raw, but functional CHM to PDF command line converter. You can find it here and it's called, quite obviously, chm2pdf. It is a small Python script that glues together chmlib (via pychm), pdftk and htmldoc. Installation and usage should be straightforward. Functionality is still pretty limited (images are still not converted, for example) but hey, it's version 0.0.2 ...

Let me know what do you think about it and drop me a mail for any feedback.

Thursday, 17 May 2007

EPS to PDF : how to avoid clipping

Again, the fancy world of interconverting formats. I had to convert a bunch of Encapsulated Postscript (.EPS) files (generated by Inkscape) into PDF pages, for work. No problem, I initally thought, there is ps2pdf that will help me.

Problem is, ps2pdf has the nasty habit to use a fixed page size by default, clipping everything that goes beyond the limits of the page. No matter if most of the drawing is outside the page: ps2pdf will silently and mercilessly cut most of it.

Added to this, ps2pdf documentation is bad by almost any standard. The problem is that ps2pdf is a script that relies on GhostScript, so ps2pdf docs are (mostly) GhostScript docs.

After almost 90 minutes of googling, I found what I needed. To convert an arbitrary EPS file into a PDF page of your standard GS page size, just type:

ps2pdf -dEPSFitPage file.eps file.pdf

Monday, 5 March 2007

Merge PDF files on Linux

Sometimes you need to merge PDF files made by someone else in a single file together. Your Windows or Mac user fellows will do probably by using Adobe Acrobat or the like. On Linux you don't need the Adobe software (although you can have it if you like). You can do it by the command line -it's Linux after all, isn't it?


The first method is by using convert , directly from the ImageMagick toolkit. You probably already have this one installed on your favourite Linux distribution; if not, almost every known Linux distribution has a package for it. Even if you don't need to edit PDF files you should have it, it is a wonderful swiss army knife for command line image processing. The syntax for merging PDF files is simple:


convert file1.pdf file2.pdf file3.pdf out.pdf


(that is, the last file name is the output file name). This usually works quite correctly, but 1)it is slow 2)I found sometimes has issues with image resolution. So I looked for another solution, and I found pdftk. The name stands for "PDF ToolKit", and really it is. It is a free (open source under the GNU GPL), wonderful command line utility that with a bit of magic allows you to manage PDF files from the command line. It works on Linux, Windows and Mac. pdftk can merge PDF documents, split PDF pages into a new document, rotate PDF Pages or Documents, decrypt and encrypt, fill PDF forms, apply a background watermark or a foreground stamp, burst a PDF Document into single pages... whatever.


The syntax for merging with pdftk is almost as simple:


pdftk file1.pdf file2.pdf file3.pdf cat output out.pdf

You just need to add the magic "cat output" between the input pdfs and the output file name. In comparison with convert, it is truly fast and in my experience gives better results. And it may come handy for when I have to work with PDF files in other ways. Also, KDE users may find nice PDF Concat, a Kommander script that acts as a simple pdftk frontend to merge PDFs.