I often encounter some problems with pdf file processing at work. There are a thousand ways to deal with pdfs. Every time I rack my brains to fight these pdfs to the end.
I am another gopher, so this article will list every PDF processing scenario I have experienced from a goper perspective, such as:
pdf rendering
pdf verification
pdf with watermark
pdf get page count
pdf merge
pdf split
Fix damaged pdf
pdf to png
Identify fonts in pdf
pdf decryption
...
Most of this article is a list of scene problems. You can extract the parts you are interested in based on the title.
I am not particularly professional about many PDF questions. If you have any questions or questions, please feel free to communicate with me.
1. HTML page rendering PDF
According to the html page rendering pdf, I have used the following two schemes:
- wkhtmltopdf
- chromedp
1. Render pdf using wkhtmltopdf
wkhtmltopdfIt is a command line tool for rendering HTML pages into PDFs, based on the Qt WebKit rendering engine
The way to use is simpler:
## Print a static html page into pdf$ wkhtmltopdf ## Print a web page into pdf$ wkhtmltopdf
wkhtmltopdf has a lot of parameters, such as:
Supports sending http post requests, suitable for rendering custom developed web pages into pdf files:
$ wkhtmltopdf --help ... --post <name> <value> Add an additional post field (repeatable) ...
Supports javascript scripts and modify html before rendering pdf:
$ wkhtmltopdf --run-script "javascript:(function(){('dom_class_name')[0]. = 'none'}())" page
More detailed parameters can be foundOfficial website documentation
If you use Go, there is also a third-party package that uses wkhtmltopdf to encapsulate:go-wkhtmltopdf
2. Render pdf using chromedp
chromedpIs a software package that drives browsers that support Chrome DevTools protocol in a faster and easier way in Go without external dependencies (such as Selenium or PhantomJS).
How to use:
package main import ( "context" "io/ioutil" "/chromedp/cdproto/page" "/chromedp/chromedp" "errors" ) func main(){ err := ChromedpPrintPdf("", "/path/to/") if err != nil { (err) return } } func ChromedpPrintPdf(url string, to string) error { ctx, cancel := (()) defer cancel() var buf []byte err := (ctx, { (url), ("body"), (func(ctx ) error { var err error buf, _, err = (). Do(ctx) return err }), }) if err != nil { return ("chromedp Run failed,err:%+v", err) } if err := (to, buf, 0644); err != nil { return ("write to file failed,err:%+v", err) } return nil }
2. Add watermark to PDF
The tools I have learned that support pdf watermarking are:
- unidoc/unipdf
- pdfcpu
/unipdf
unidocPlatform developedunipdfIt is a PDF library written in Go language, providing API and CLI usage mode, supporting the following functions:
$ unipdf -h ... Available Commands: decrypt Decrypt PDF files encrypt Encrypt PDF files explode Explodes the input file into separate single page PDF files extract Extract PDF resources form PDF form operations grayscale Convert PDF to grayscale help Help about any command info Output PDF information merge Merge PDF files optimize Optimize PDF files passwd Change PDF passwords rotate Rotate PDF file pages search Search text in PDF files split Split PDF files version Output version information and exit watermark Add watermark to PDF files ...
Add watermarks to CLI mode
$ unipdf watermark -o Watermark successfully applied to Output file saved to
Use the API to add watermarks, you can directly refer to unipdf github example
Note: Unidoc products require a paid purchase of license
pdfcpuIt is a PDF processing library written in Go language, providing API and CLI mode usage
Supports the following functions:
$ pdfcpu help ... The commands are: attachments list, add, remove, extract embedded file attachments changeopw change owner password changeupw change user password decrypt remove password protection encrypt set password protection extract extract images, fonts, content, pages, metadata fonts install, list supported fonts grid rearrange pages or images for enhanced browsing experience import import/convert images to PDF info print file info merge concatenate 2 or more PDFs nup rearrange pages or images for reduced number of pages optimize optimize PDF by getting rid of redundant page resources pages insert, remove selected pages paper print list of supported paper sizes permissions list, set user access permissions rotate rotate pages split split multi-page PDF into several PDFs according to split span stamp add, remove, update text, image or PDF stamps for selected pages trim create trimmed version of selected pages validate validate PDF against PDF 32000-1:2008 (PDF 1.7) version print version watermark add, remove, update text, image or PDF watermarks for selected pages ...
Use the CLI tool to add watermarks as images:
$ pdfcpu watermark add -mode image 'voucher_watermark.png' 's:1 abs, rot:0'
Call API to add watermark
package main import ( "/pdfcpu/pdfcpu/pkg/api" "/pdfcpu/pdfcpu/pkg/pdfcpu" ) func main() { onTop := false wm, _ := ("", "s:1 abs, rot:0", onTop) ("", "", nil, wm, nil) }
3. PDF merger
- cpdf
- unipdfc
- pdfcpu
1. Use cpdf to merge pdf
cpdfIt is an open source free PDF command line tool library with rich functions, such as:
- Merge PDF files together, or split them apart
- Encrypt and decrypt
- Scale, crop and rotate pages
- Read and set document info and metadata
- Copy, add or remove bookmarks
- Stamp logos, text, dates, page numbers
- Add or remove attachments
- Losslessly compress PDF files
Merge pdf:
$ cpdf -merge -o
2. Use unipdf to merge pdf
$ unipdf merge
Merge pdf using API, refer to unpdf github example
3. Use pdfcpu to merge pdf
$ pdfcpu merge
Note: pdfcpu only supports pdf files with versions lower than PDF V1.7
4. Split PDF
- cpdf
- unipdf
- pdfcpu
1. Use cpdf to split pdf
## Split into single pdf page by page$ cpdf -split 1 even -chunk 1 -o ./out%%%.pdf
2. Use unipdf to split pdf
## Split the first page$ unipdf split 1-1
Use API to split pdf, referenceunipdf github examples
3. Use pdfcpu to split pdf
$ pdfcpu split .
5. PDF to pictures
- mupdf
- xpdf
1. Use mupdf to convert pdf to pictures
MuPDF is a lightweight PDF, XPS, and E-book viewer.
MuPDF consists of a software library, command line tools, and viewers for various platforms.
After downloading mupdf, you can get some tools, such as:
mupdf
pdfdraw
pdfinfo
pdfclean
pdfextract
pdfshow
xpsdraw
where pdfdraw can be used to convert pictures
$ pdfdraw -o out%
Note: mupdf does not support mac OS
2. Use xpdf to convert pdf to pictures
xpdfIt is a free PDF toolkit, including text parsing, image conversion, html conversion, etc.
After downloading the software package, you can get a series of tools:
pdfdetach
pdffonts
pdfimages
pdfinfo
pdftohtml
pdftopng
pdftoppm
pdftops
pdftotext
From the name, you can roughly see the usefulness of each tool
## Use pdftopng to convert pdf to png$ pdftopng out-prefix
6. PDF decryption
There is often a scenario where an error is reported when reading a PDF file: the file is encrypted
But how to solve it without a password?
- Decryption using qpdf
useqpdfForced decryption, some cases can be successfully decrypted, but some cases may not be successfully decrypted.
qpdf is a pdf tool that supports command line
$ qpdf --decrypt
Decryption using pdfcpu
$ pdfcpu decrypt
When there is a password, you can use the password to decrypt it:
Decrypt pdf using unipdf
$ unipdf decrypt -p pass -o
7. PDF recognition
There are often scenarios, such as identifying whether a file is a pdf file, identifying text in pdf, identifying pictures in pdf, etc.
1. Identify the text in pdf
Here, xpdf is used to parse the text in pdf, and then use some string operations or regular expressions for business analysis.
Use xpdf/pdftotext to parse text in pdf
$ pdftotext
Use unipdf to parse text in pdf
$ unipdf extract text
Use API to parse pdf text, referenceunipdf github examples
Use coordinate information to parse pdf data
The above is first parsed out the PDF text and then processed according to the business.
Another way is to parse PDF according to coordinate position. This method is more flexible and general, usingpdflib/tet
## Enter a set of coordinates to parse the data in pdf according to the coordinates$ tet --pageopt "includebox={{38 707.93 243.91 716.93}}"
Coordinates can be analyzed using tet to obtain a tetml file containing coordinate information:
$ tet --tetml
Of course, you can also use some other methods to obtain the coordinate information of the data in PDF, such as nodejs, etc.
Note: pdflib/tet is a paid software, but according to the official documentation, tet provides basic functions and does not require purchasing license when processing pdf files that do not exceed 10 pages or less than 1M.
pdflib/tet provides command line tools and SDK support in multiple languages, such as C/C++/Java/.NET/Perl/PHP/Python/Ruby/Swift, but the Go language is not supported at present, so there are only two options for gopher: CLI OR CGO
8. Repair damaged PDF files
Some pdf files are displayed normally when they are opened on the computer, but they are not normal to use code to detect them. For example, in Go, try to use a third-party library to parse a (corrupted) pdf:
import ( "fmt" "//pdf" ) func main() { filePath := "path/to/your/" _, err := (filePath) if err != nil { ("open pdf failed,err:", ()) return } }
After running, you will get such a result:
open pdf failed,err: malformed PDF: cross-reference table not found: {5 0 obj}<</Contents 6 0 R /Group <</CS /DeviceRGB /S /Transparency /Type /Group>> /MediaBox [0 0 595.27600098 841.89001465] /Parent 3 0 R /Type /Page>>
The computer is turned on normally, but the program is read incorrectly!
At this time, if you try to open the pdf on your computer, save it as a new pdf file, and then use the code to detect it, you will find that it has been repaired!
Great, the problem is solved!
Wait, if I have 1000 pdf files, do I have to open one by one and save as? How can I bear this? So it would be great if there is a batch repair function
After searching online for a long time, I got about three solutions:
- Use Acrobat SDK to call the SDKSave as function, can realize the effect of opening Save As for the computer
- Use ghostscript for pdf repair
- usemupdfDo PDF repair
Here I only verified that the third method is feasible. Here I use mupdf-0.9-linux-amd64 version to verify it
After downloading the software package, you will get one of the executable files: pdfclean
$ pdfclean + pdf/pdf_xref.c:160: pdf_read_trailer(): cannot recognize xref format: '%' | pdf/pdf_xref.c:481: pdf_load_xref(): cannot read trailer \ pdf/pdf_xref.c:537: pdf_open_xref_with_stream(): trying to repair
Judging from the output results, mupdf tried to repair the problem
After getting the new pdf file, try to open it with the previous Go code, and it will be normal
All that is left is to write a bash script, batch repair, and achieve the goal!
9. Identify the font information of a PDF file
Sometimes to keep multiple pdf text fonts consistent, you must analyze which fonts are used in pdf. At this time, you can use xpdf/pdffonts for font analysis.
$ pdffonts name type encoding emb sub uni object ID ------------------------------------ ----------------- ---------------- --- --- --- --------- NimbusSanL-Regu CID TrueType Identity-H yes no yes 10 0 NimbusSanL-Bold CID TrueType Identity-H yes no yes 20 0
Other Libiray introductions:
PDF-Writer
This is an open source library of C++ that supports creating pdfs, merging pdfs, image watermark text operations, etc.
For gopher, to use this library, you need to encapsulate a layer of CGO code to
rsc/pdf
This is a PDF library implemented in Go language, which can be used to read PDF information, such as reading PDF content, page number, font, etc.... For details, please refer todocument
After introducing so many third-party libraries, they are simply a variety of ways, each showing its magical powers. Some functions are duplicated in most libraries. What problems will be encountered during use depends on the actual situation.
Hope these summary will be helpful to readers
refer to:
wkhtmltopdf
xpdf
cpdf
qpdf
unidoc
pdflib/tet
pdfwriter
mupdf
pdfcpu
The above is all the content of this article. I hope it will be helpful to everyone's study and I hope everyone will support me more.