# Portable Document Format

### On using a thousand dollar computer to simulate a one cent piece of paper with zero day exploits

Usefulness: 🔧
Novelty: 💡
Uncertainty: 🤪 🤪 🤪
Incompleteness: 🚧 🚧 🚧

Portable Document Format, the obscure and inconvenient format beloved of academics, bureaucracies and… Adobe, I suppose? It has the wonderful feature of being a better format than Microsoft Word, in much the same way that sticking your hand in a blender is better than sticking your hand in a woodchipper.

## Extracting data

Tabula

Tabula is a tool for liberating data tables locked inside PDF files.

pdfplumber also exists but I have not used it.

Camelot is an OpenCV-backed table extractor. It has a browser-based gui, Excalibur.

There are both open (Tabula, pdfplumber) and closed-source tools that are widely used to extract data tables from PDFs. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy.

tl;dr Tabula if you want the easiest possibly experience at the cost of some power, otherwise Camelot/Excalibur.

camelot -o table.csv -f csv lattice file.pdf

I woudl like to have a PDF reader that supported the same annotations scross linux and macOS and android. Is that feasible?

I’ve always hated the gigantic Adobe Acrobloat reader, but it has become useless to me now since there no longer seems to be a linux version.

The KDE (okular) and GNOME (Evince default are both passable but have clunky annotation exchange. Evince in particular has a terrible UI for even viewing annotations – It summarises the text of each annotation in the sidebar as … my name and the date? I already know my name. It’s the content of the PDF I am concerned with here. I’m not saying there is no conceivable use case for this, just not a plausibly common one.

mupdf might be an option although the linux versions look outdated.

sudo add-apt-repository ppa:ubuntuhandbook1/apps
sudo apt update
sudo apt install mupdf mupdf-tools

There is cross platform support from the commercial app foxit.

## Command line tips

### Reduce size of bloated PDF

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 \
-dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=output.pdf input.pdf

or, wrapped up into a nice little script, ShrinkPDF: (90 is the dpi here.)

./shrinkpdf.sh in.pdf out.pdf 90

There is also cpdf and the GUI version Densify

### Concatenate or split PDFs

This ghostcript command works to concatenate PDFs:

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress -sOutputFile=output.pdf input*.pdf

You can split PDFs also with ghostscript, but usually you want a GUI to see what you are splitting, no?

PDFMix and PDF shuffler have both been recommended to me for this.

### EPS to PDF

EPS to PDF conversion:

ps2pdf14 -dEPSCrop Logo.eps

### PDF to SVG

pdf2svg generates editable vector diagrams from the PDF.

### HTML to PDF

See weasyprint, below.

### Diffing PDFs

The use case here is, they say, a (presumably scientific) review. “You reviewed version A of a paper, and receive version B, and wonder what the changes are.” The tool is pdfdiff.

## Crop marks

There are at least two options

None makes it clear which of TrimBox, BleedBox, Cropbox or ArtBox is what you truly want. This might clarify it slightly but I lost focus around here.

### Method A

You can add crop marks to a PDF document with different PDF tools, e.g. pdftk.:

1. Export the first page with crop marks to a PDF file (your_cropmark.pdf)
2. Join it with your PDF document (your_document.pdf) in the command line:
pdftk your_document.pdf multistamp your_cropmark.pdf output result.pdf

### Method B

NOTE: you can set PDF cropping values with GhostScript for printing:

Create a plain text file with the right cropping values (eg. this is 5mm crop of A4):

[/CropBox [14.17 14.17 581.1 827.72] /PAGES pdfmark

Alternatively, use the command line

gs -c "[/CropBox  [14.17 14.17 581.1 827.72] /PAGES pdfmark" \

Now, convert your_document.pdf using the previous file (pdfmark.txt):

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
\$OPTIONS \
-c .setpdfwrite \
-sOutputFile=result.pdf \
-f your_document.pdf
pdfmark.txt

## Color conversion

Nightmares. Colour management is generally complicated. ghostcript colour management specifically is complicated, and has many moving parts that change rapidly – e.g. the -dUseCIEColor option was removed in ghostscript 9, because it is apparently a broken noob feature. Its replacement is broken documentation.

### CMYK

CMYK Color conversion of RGB PDF with GhostScript:

PDF to TIFF example.

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
-sColorConversionStrategy=CMYK \
-sColorConversionStrategyForImages=CMYK \
-sDEVICE=pdfwrite \
-dProcessColorModel=/DeviceCMYK \
-dCompatibilityLevel=1.5 \
-sOutputFile=result_cmyk.pdf \
your_document.pdf

### Greyscale

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray \
-dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.5 \
-sOutputFile=result_gray.pdf \
your_document.pdf

## Programmatic editing and generation

So many of these!

Weasyprint seems the cleanest. It converts HTML+CSS into PDF, and is written in pure python. It can be used from the command line or programatically.

It is based on various libraries but not on a full rendering engine like Blink, Gecko or WebKit. The CSS layout engine is written in Python, designed for pagination, and meant to be easy to hack on.

pip install weasyprint
weasyprint https://weasyprint.org/ weasyprint.pdf

svglib provides a pure python library that can convert SVG to PDF, and a command line utility for same, svg2pdf. Thus one can, e.g. add SVGs to PDFs in erportlab.

reportlab is the famed monstrous classic way of programatically generating PDFs from code. It includes a modicum of typesetting. It doesn’t edit PDFs so much, but it generates them pretty well. Its integration with other things is often weak – if you thought that insert LaTeX equations would be simple, or HTML snippets etc. On the other hand it has fancy features such as its own chart generation library. On the third hand, there are already better charting libraries that it doesn’t use. Litmus test: Use it if the following feels to you like a natural way to print two columns:

from reportlab.platypus import (
BaseDocTemplate,
Frame,
Paragraph,
PageBreak,
PageTemplate )
from reportlab.lib.styles import getSampleStyleSheet
import random

words = (
"lorem ipsum dolor sit amet consetetur "
"sadipscing elitr sed diam nonumy eirmod "
"tempor invidunt ut labore et").split()

styles=getSampleStyleSheet()
Elements=[]

doc = BaseDocTemplate(
'basedoc.pdf',
showBoundary=1)

#Two Columns

frame1 = Frame(
doc.leftMargin,
doc.bottomMargin,
doc.width/2-6,
doc.height,
id='col1')
frame2 = Frame(
doc.leftMargin+doc.width/2+6,
doc.bottomMargin,
doc.width/2-6,
doc.height,
id='col2')

Elements.append(
Paragraph(
" ".join([random.choice(words) for i in range(1000)]),
styles['Normal']))
PageTemplate(id='TwoCol',frames=[frame1,frame2]),
])

#start the construction of the pdf
doc.build(Elements)

pdfrw is a Python library and utility that reads and writes PDF files:

• Operations include subsetting, merging, rotating, modifying metadata, etc. […]
• Can be used either standalone, or in conjunction with reportlab to reuse existing PDFs in new ones

Here is a gentle HOWTO. You can use it to put matplotlib plots in reportlab PDFs, getting the best of two bad worlds.

pypdf2 is another alternative python pdf library that looks messier.

scribus is a reasonable open-source desktop publishing tool. If your content is not amenable to automatic layout out it is a good choice, for e.g. posters. It includes a Python API, albeit a reputedly quirky one, which is AFAICT Python 2. For all that, it’s a simple and interactive way of generating PDFs programmatically, so might be worth it.