Go to process PDF implementation code

I often encounter some problems with pdf file processing at work. There are a thousand ways to deal with pdfs. Every time I rack my brains to fight these pdfs to the end.

I am another gopher, so this article will list every PDF processing scenario I have experienced from a goper perspective, such as:

pdf rendering
pdf verification
pdf with watermark
pdf get page count
pdf merge
pdf split
Fix damaged pdf
pdf to png
Identify fonts in pdf
pdf decryption
...

Most of this article is a list of scene problems. You can extract the parts you are interested in based on the title.

I am not particularly professional about many PDF questions. If you have any questions or questions, please feel free to communicate with me.

1. HTML page rendering PDF

According to the html page rendering pdf, I have used the following two schemes:

wkhtmltopdf
chromedp

1. Render pdf using wkhtmltopdf

wkhtmltopdfIt is a command line tool for rendering HTML pages into PDFs, based on the Qt WebKit rendering engine

The way to use is simpler:

## Print a static html page into pdf$ wkhtmltopdf  

## Print a web page into pdf$ wkhtmltopdf

wkhtmltopdf has a lot of parameters, such as:

Supports sending http post requests, suitable for rendering custom developed web pages into pdf files:

$ wkhtmltopdf --help
...
--post <name> <value>      Add an additional post field (repeatable)
...

Supports javascript scripts and modify html before rendering pdf:

$ wkhtmltopdf --run-script "javascript:(function(){('dom_class_name')[0]. = 'none'}())" page

More detailed parameters can be foundOfficial website documentation

If you use Go, there is also a third-party package that uses wkhtmltopdf to encapsulate:go-wkhtmltopdf

2. Render pdf using chromedp

chromedpIs a software package that drives browsers that support Chrome DevTools protocol in a faster and easier way in Go without external dependencies (such as Selenium or PhantomJS).

How to use:

package main

import (
  "context"
  "io/ioutil"

  "/chromedp/cdproto/page"
  "/chromedp/chromedp"
  "errors"
)

func main(){
  err := ChromedpPrintPdf("", "/path/to/")
  if err != nil {
    (err)
    return
  }
}

func ChromedpPrintPdf(url string, to string) error {
  ctx, cancel := (())
  defer cancel()

  var buf []byte
  err := (ctx, {
    (url),
    ("body"),
    (func(ctx ) error {
      var err error
      buf, _, err = ().
        Do(ctx)
      return err
    }),
  })
  if err != nil {
    return ("chromedp Run failed,err:%+v", err)
  }

  if err := (to, buf, 0644); err != nil {
    return ("write to file failed,err:%+v", err)
  }

  return nil
}

2. Add watermark to PDF

The tools I have learned that support pdf watermarking are:

unidoc/unipdf
pdfcpu

/unipdf

unidocPlatform developedunipdfIt is a PDF library written in Go language, providing API and CLI usage mode, supporting the following functions:

$ unipdf -h
...
Available Commands:
 decrypt   Decrypt PDF files
 encrypt   Encrypt PDF files
 explode   Explodes the input file into separate single page PDF files
 extract   Extract PDF resources
 form    PDF form operations
 grayscale  Convert PDF to grayscale
 help    Help about any command
 info    Output PDF information
 merge    Merge PDF files
 optimize  Optimize PDF files
 passwd   Change PDF passwords
 rotate   Rotate PDF file pages
 search   Search text in PDF files
 split    Split PDF files
 version   Output version information and exit
 watermark  Add watermark to PDF files
...

Add watermarks to CLI mode

$ unipdf watermark   -o 

Watermark successfully applied to 
Output file saved to

Use the API to add watermarks, you can directly refer to unipdf github example

Note: Unidoc products require a paid purchase of license

pdfcpuIt is a PDF processing library written in Go language, providing API and CLI mode usage

Supports the following functions:

$ pdfcpu help
...
The commands are:

  attachments list, add, remove, extract embedded file attachments
  changeopw  change owner password
  changeupw  change user password
  decrypt   remove password protection
  encrypt   set password protection
  extract   extract images, fonts, content, pages, metadata
  fonts    install, list supported fonts
  grid    rearrange pages or images for enhanced browsing experience
  import   import/convert images to PDF
  info    print file info
  merge    concatenate 2 or more PDFs
  nup     rearrange pages or images for reduced number of pages
  optimize  optimize PDF by getting rid of redundant page resources
  pages    insert, remove selected pages
  paper    print list of supported paper sizes
  permissions list, set user access permissions
  rotate   rotate pages
  split    split multi-page PDF into several PDFs according to split span
  stamp    add, remove, update text, image or PDF stamps for selected pages
  trim    create trimmed version of selected pages
  validate  validate PDF against PDF 32000-1:2008 (PDF 1.7)
  version   print version
  watermark  add, remove, update text, image or PDF watermarks for selected pages
...

Use the CLI tool to add watermarks as images:

$ pdfcpu watermark add -mode image 'voucher_watermark.png' 's:1 abs, rot:0'

Call API to add watermark

package main

import (
  "/pdfcpu/pdfcpu/pkg/api"
  "/pdfcpu/pdfcpu/pkg/pdfcpu"
)

func main() {
  onTop := false
  wm, _ := ("", "s:1 abs, rot:0", onTop)
  ("", "", nil, wm, nil)
}

3. PDF merger

cpdf
unipdfc
pdfcpu

1. Use cpdf to merge pdf

cpdfIt is an open source free PDF command line tool library with rich functions, such as:

Merge PDF files together, or split them apart
Encrypt and decrypt
Scale, crop and rotate pages
Read and set document info and metadata
Copy, add or remove bookmarks
Stamp logos, text, dates, page numbers
Add or remove attachments
Losslessly compress PDF files

Merge pdf:

$ cpdf -merge   -o

2. Use unipdf to merge pdf

$ unipdf merge

Merge pdf using API, refer to unpdf github example

3. Use pdfcpu to merge pdf

$ pdfcpu merge

Note: pdfcpu only supports pdf files with versions lower than PDF V1.7

4. Split PDF

cpdf
unipdf
pdfcpu

1. Use cpdf to split pdf

## Split into single pdf page by page$ cpdf -split  1 even -chunk 1 -o ./out%%%.pdf

2. Use unipdf to split pdf

## Split the first page$ unipdf split   1-1

Use API to split pdf, referenceunipdf github examples

3. Use pdfcpu to split pdf

$ pdfcpu split  .

5. PDF to pictures

mupdf
xpdf

1. Use mupdf to convert pdf to pictures

MuPDF is a lightweight PDF, XPS, and E-book viewer.
MuPDF consists of a software library, command line tools, and viewers for various platforms.

After downloading mupdf, you can get some tools, such as:

mupdf
pdfdraw
pdfinfo
pdfclean
pdfextract
pdfshow
xpsdraw

where pdfdraw can be used to convert pictures

$ pdfdraw -o out%

Note: mupdf does not support mac OS

2. Use xpdf to convert pdf to pictures

xpdfIt is a free PDF toolkit, including text parsing, image conversion, html conversion, etc.

After downloading the software package, you can get a series of tools:

pdfdetach
pdffonts
pdfimages
pdfinfo
pdftohtml
pdftopng
pdftoppm
pdftops
pdftotext

From the name, you can roughly see the usefulness of each tool

## Use pdftopng to convert pdf to png$ pdftopng  out-prefix

6. PDF decryption

There is often a scenario where an error is reported when reading a PDF file: the file is encrypted

But how to solve it without a password?

Decryption using qpdf

useqpdfForced decryption, some cases can be successfully decrypted, but some cases may not be successfully decrypted.

qpdf is a pdf tool that supports command line

$ qpdf --decrypt

Decryption using pdfcpu

$ pdfcpu decrypt

When there is a password, you can use the password to decrypt it:

Decrypt pdf using unipdf

$ unipdf decrypt -p pass -o

7. PDF recognition

There are often scenarios, such as identifying whether a file is a pdf file, identifying text in pdf, identifying pictures in pdf, etc.

1. Identify the text in pdf

Here, xpdf is used to parse the text in pdf, and then use some string operations or regular expressions for business analysis.

Use xpdf/pdftotext to parse text in pdf

$ pdftotext

Use unipdf to parse text in pdf

$ unipdf extract text

Use API to parse pdf text, referenceunipdf github examples

Use coordinate information to parse pdf data

The above is first parsed out the PDF text and then processed according to the business.

Another way is to parse PDF according to coordinate position. This method is more flexible and general, usingpdflib/tet

## Enter a set of coordinates to parse the data in pdf according to the coordinates$ tet --pageopt "includebox={{38 707.93 243.91 716.93}}"

Coordinates can be analyzed using tet to obtain a tetml file containing coordinate information:

$ tet --tetml

Of course, you can also use some other methods to obtain the coordinate information of the data in PDF, such as nodejs, etc.

Note: pdflib/tet is a paid software, but according to the official documentation, tet provides basic functions and does not require purchasing license when processing pdf files that do not exceed 10 pages or less than 1M.

pdflib/tet provides command line tools and SDK support in multiple languages, such as C/C++/Java/.NET/Perl/PHP/Python/Ruby/Swift, but the Go language is not supported at present, so there are only two options for gopher: CLI OR CGO

8. Repair damaged PDF files

Some pdf files are displayed normally when they are opened on the computer, but they are not normal to use code to detect them. For example, in Go, try to use a third-party library to parse a (corrupted) pdf:

import (
  "fmt"
  "//pdf"
)

func main() {
  filePath := "path/to/your/"
  _, err := (filePath)
  if err != nil {
    ("open pdf failed,err:", ())
    return
  }
}

After running, you will get such a result:

open pdf failed,err: malformed PDF: cross-reference table not found: {5 0 obj}<</Contents 6 0 R /Group <</CS /DeviceRGB /S /Transparency /Type /Group>> /MediaBox [0 0 595.27600098 841.89001465] /Parent 3 0 R /Type /Page>>

The computer is turned on normally, but the program is read incorrectly!

At this time, if you try to open the pdf on your computer, save it as a new pdf file, and then use the code to detect it, you will find that it has been repaired!

Great, the problem is solved!

Wait, if I have 1000 pdf files, do I have to open one by one and save as? How can I bear this? So it would be great if there is a batch repair function

After searching online for a long time, I got about three solutions:

Use Acrobat SDK to call the SDKSave as function, can realize the effect of opening Save As for the computer
Use ghostscript for pdf repair
usemupdfDo PDF repair

Here I only verified that the third method is feasible. Here I use mupdf-0.9-linux-amd64 version to verify it

After downloading the software package, you will get one of the executable files: pdfclean

$ pdfclean  

+ pdf/pdf_xref.c:160: pdf_read_trailer(): cannot recognize xref format: '%'
| pdf/pdf_xref.c:481: pdf_load_xref(): cannot read trailer
\ pdf/pdf_xref.c:537: pdf_open_xref_with_stream(): trying to repair

Judging from the output results, mupdf tried to repair the problem

After getting the new pdf file, try to open it with the previous Go code, and it will be normal

All that is left is to write a bash script, batch repair, and achieve the goal!

9. Identify the font information of a PDF file

Sometimes to keep multiple pdf text fonts consistent, you must analyze which fonts are used in pdf. At this time, you can use xpdf/pdffonts for font analysis.

$ pdffonts 
name                 type       encoding     emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
NimbusSanL-Regu           CID TrueType   Identity-H    yes no yes   10 0
NimbusSanL-Bold           CID TrueType   Identity-H    yes no yes   20 0

Other Libiray introductions:

PDF-Writer
This is an open source library of C++ that supports creating pdfs, merging pdfs, image watermark text operations, etc.

For gopher, to use this library, you need to encapsulate a layer of CGO code to

rsc/pdf
This is a PDF library implemented in Go language, which can be used to read PDF information, such as reading PDF content, page number, font, etc.... For details, please refer todocument

After introducing so many third-party libraries, they are simply a variety of ways, each showing its magical powers. Some functions are duplicated in most libraries. What problems will be encountered during use depends on the actual situation.

Hope these summary will be helpful to readers

refer to:

wkhtmltopdf
xpdf
cpdf
qpdf
unidoc
pdflib/tet
pdfwriter
mupdf
pdfcpu

The above is all the content of this article. I hope it will be helpful to everyone's study and I hope everyone will support me more.