PDFs are a one-way street

PDF is a file format that offers significant advantages. The software to read PDFs is free, and the formatting of PDFs remains intact, even if the reader does not have the same fonts on his device. PDFs are also easy to create. All you need is a source file, i.e. a file created in any app. Free plugins are available for Windows, while Mac users don’t need them as they can save files as PDFs from any application.

This makes PDF (Wikipedia) an ideal format for distributing texts such as manuals, newsletters, brochures, press releases, reports, annual reports, ebooks, etc. The recipient doesn’t need to have Word, Affinity Publisher, InDesign, Excel, Pages, etc., and can read PDFs on Windows, Linux, macOS, iOS, iPadOS, and Android.

The problem

PDFs also come with serious disadvantages. If you need to make more than the most basic of edits to a PDF, you have no other choice than to open the source file in the program in which it was created, modify the file, and create a new PDF.

If you need to modify a PDF you have not yourself created (i.e. you don’t have the source file), than you will have to convert it to another format. This conversion is anything but straightforward. Creating a PDF is easy; the road back to an editable file is tiresome struggle. The more complex the layout, the greater the chaos when you copy a page and paste it into a text editor or word processor.

In the image next to this, the light blue numbers indicate the order in which the various text elements are pasted. Everything gets mixed up, leaving you with a text document that is not very useful. If the PDF contains columns and multiple text boxes, it can happen that headings that were at the top end up somewhere in the middle of the text, and the order of paragraphs gets completely shuffled. If there are tables in the PDF, the chaos after copying and pasting is even greater.

The reason is that PDF is a destructive format. Words, lines, and paragraphs do not exist in a PDF. Instead, coordinates are stored for each element on the page. There is no continuous text flow.

The Solutions

There are now many applications that allow you to convert a PDF to an editable format (such as Word, RTF, or plain text). Usually, the results leave much to be desired. Complex algorithms try to preserve the text flow and layout, but each PDF is different, so the results are a mixed bag. Click in a new Word document after your conversion, and you will see to your dismay that all paragraphs are in separate text boxes, requiring a lot of cutting and pasting to create an editable document.

The best option, I found, is to start from scratch. That means copying as many paragraphs in one go as you can and paste them as plain text in the editor of your choice. If that doesn’t yield the best results, OCR, or optical character recognition (Wikipedia), is another option. This technology was invented to recognize text scanned from paper but can also be used for PDFs. In this case, the internal structure of the PDF (or lack thereof) does not play a role during the recognition process, and depending on the complexity of the layout, you get a more readable text that requires less post-processing. The best results I have achieved with ABBYY Finereader, which is available for both Mac and Windows. The Mac version is available in the App Store for € 79.99.

Postscript 05-14-2023: I am currently working on the translation of Barbra Streisand’s memoirs. Copying from the manuscript is a challenge because the numbering in the margins gets copied as well. The solution is TextSniper, an app that allows you to copy everything on your screen as an image and then convert it to text using OCR. OCR for a whole page takes less than a second and has been error-free so far. I use TextSniper daily for all the text that is not copyable, an option in an app, a part of a web page that is copy-protected, text on an image, etc. The app is cheaper than Abby Finereader and also easier to use.

If you need full control over the conversion from PDF to text, you can still copy and paste manually. This usually means that you have to copy a page in parts to prevent everything from getting mixed up. If you see the selection jumping to a distant part of the text while selecting with the mouse, you know you need to select less text.

You can paste in Word or another word processor, but there is a risk that you will paste the underlying code of the PDF into your document. You can prevent this by pasting without formatting (a feature available in most word processors) or, even better, by pasting in a text editor. In Windows, this is Notepad; on Mac, it’s TextEdit. In my opinion, working in a simple text editor beat working in apps like Microsoft Word hands down.

Even when pasting small text portions, there is often still some post-processing needed. Line breaks or hard returns are often inserted where they shouldn’t be, hyphens need to be removed from broken words, and in some words, spaces appear where they shouldn’t, making “financial” become “fi nancial.”

With find-and-replace commands and a few macros, the copy-and-paste process can still be relatively quick, although the road from PDF to editable is nowhere near as easy as it could be.


Copyright © 10-02-2012 Theo van der Ster

The comments under this blog post are closed. If you want to get in touch, you can do so on this page.