The Living Thing / Notebooks :

PDFs

Using a thousand dollar computer to simulate a one cent piece of paper with zero day exploits

Command line tips

Reduce size of bloated PDF:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 \
    -dPDFSETTINGS=/ebook \
    -dNOPAUSE -dQUIET -dBATCH \
    -sOutputFile=output.pdf input.pdf

or, wrapped up into a nice little script, ShrinkPDF: (90 is the dpi here.)

./shrinkpdf.sh in.pdf out.pdf 90

There is also cpdf and the GUI version Densify

This works to concatenate PDFs:

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite \
    -dPDFSETTINGS=/prepress -sOutputFile=output.pdf input*.pdf

EPS to PDF conversion:

ps2pdf14 -dEPSCrop Logo.eps

Quick and dirty RGB-CMYK using recent ghostscript? Hmm.

Diff PDFs? “(Scientific) Reviews: you reviewed version A of a paper, and receive version B, and wonder what the changes are.”

pdfdiff!

Programmatic editing and generation

pdfrw:

pdfrw is a Python library and utility that reads and writes PDF files:

Here is a gentle HOWTO. You can use it to put matplotlib plots in reportlab PDFs

svglib provides a pure python library that can convert SVG to PDF, and a command line utility for same, svg2pdf. you can add SVGs to PDFs.

reportlab is far more famous and even includes a modicum of typesetting. It doesn’t edit PDFs so much, but it generates them pretty well. It’s integration with other things is often a little week – if you though that dropping LaTeX equations in would be simple, or HTML snippets etc. OTOH it includes its own chart generation and so on. Use it if this is a natural way to make two columns for you:

from reportlab.platypus import (
    BaseDocTemplate,
    Frame,
    Paragraph,
    PageBreak,
    PageTemplate )
from reportlab.lib.styles import getSampleStyleSheet
import random

words = (
    "lorem ipsum dolor sit amet consetetur "
    "sadipscing elitr sed diam nonumy eirmod "
    "tempor invidunt ut labore et").split()

styles=getSampleStyleSheet()
Elements=[]

doc = BaseDocTemplate(
    'basedoc.pdf',
    showBoundary=1)

#Two Columns
frame1 = Frame(
    doc.leftMargin,
    doc.bottomMargin,
    doc.width/2-6,
    doc.height,
    id='col1')
frame2 = Frame(
    doc.leftMargin+doc.width/2+6,
    doc.bottomMargin,
    doc.width/2-6,
    doc.height,
    id='col2')

Elements.append(
    Paragraph(
        " ".join([random.choice(words) for i in range(1000)]),
        styles['Normal']))
doc.addPageTemplates([
    PageTemplate(id='TwoCol',frames=[frame1,frame2]),
])

#start the construction of the pdf
doc.build(Elements)

pypdf2 is another alternative python pdf library.

scribus is a reasonable open-source desktop publishing tool. If your content is not amenable to automatic layout out it is a good choice, for e.g. posters. It includes a Python API, albeit a reputedly quirky one, which is AFAICT Python 2.

For all that, it’s the cleanest way I have yet seen of generating PDFs, so might be a goer for you.

crop marks

There areat least two options

None makes it clear which of TrimBox, BleedBox, Cropbox or ArtBox is what you truly want. This might clarify it slightly but I lost focus on this point.

Method A

You can add crop marks to a PDF document with different PDF tools, eg. pdftk.:

  1. Export the first page with crop marks to a PDF file (your_cropmark.pdf)
  2. Join it with your PDF document (your_document.pdf) in the command line:
pdftk your_document.pdf multistamp your_cropmark.pdf output result.pdf

Method B

NOTE: you can set PDF cropping values with GhostScript for printing:

Create a plain text file with the right cropping values (eg. this is 5mm crop of A4):

[/CropBox [14.17 14.17 581.1 827.72] /PAGES pdfmark

Alternatively, use the command line

gs -c "[/CropBox  [14.17 14.17 581.1 827.72] /PAGES pdfmark" \
  1. Convert your_document.pdf using the previous file (pdfmark.txt):
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
    $OPTIONS \
    -c .setpdfwrite \
    -sOutputFile=result.pdf \
    -f your_document.pdf
    pdfmark.txt

Color conversion

Nightmares. Colour management is generally complicated. ghostcript colour management speciifically is complicated, and has many moving parts plus rapid changes – e.g. the -dUseCIEColor option was removed in ghostscript 9, because it is apparently a noob feature which has broken functionality. Its replacement is broken documentation.

CMYK

NOTE II: optional color conversion of RGB PDF with GhostScript:

PDF to TIFF example.

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
    -sColorConversionStrategy=CMYK \
    -sColorConversionStrategyForImages=CMYK \
    -sDEVICE=pdfwrite \
    -dProcessColorModel=/DeviceCMYK \
    -dCompatibilityLevel=1.5 \
    -sOutputFile=result_cmyk.pdf \
    your_document.pdf

grayscale

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
    -sColorConversionStrategy=Gray \
    -dProcessColorModel=/DeviceGray \
    -dCompatibilityLevel=1.5 \
    -sOutputFile=result_gray.pdf \
    your_document.pdf

Extracting data from

Tabula

Tabula is a tool for liberating data tables locked inside PDF files.

pdfplumber also exists but I have not used it.

Camelot is an OpenCV-backed table extractors. It has a browser-based gui, Excalibur.

The creator explains:

There are both open (Tabula, pdfplumber) and closed-source tools that are widely used to extract data tables from PDFs. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy.

tl;dr Tabula if you want the easiest possibly experience at the cost of some power, otherwise Camelot/Excalibur.

camelot -o table.csv -f csv lattice file.pdf

Conversion

pdf2svg extracts editable vector diagrams from the PDF.