SoFunction
Updated on 2024-11-16

An article to teach you to use Python to read PDF documents

real-world scenario

Python engineers in their daily work, often encountered in the parsing and processing of PDF documents, the actual demand is mainly divided into the following cases:

  • Extract text from PDF
  • Convert each page in a PDF to an image
  • Convert word to PDF
  • PDF generation, editing, import and export
  • PDF online rendering

Except for the last item, which requires the cooperation of the front-end, the rest can be implemented directly on the python side.

This real-world choicepdfplumber library for learning, you can install the library in advance, but there is a need to pay attention to, the library is mainly used to read PDF for operations, writing and editing can not be achieved, that is, in this paper, to learn a focus on PDF content extraction library.

> pip install pdfplumber -i /simple

pdfplumber The library has the following features:

  • You can access the details of any element in the PDF object;
  • Text and tables can be extracted and the usage is simple;
  • Integrated visual debugging.

Python PDF Practical Coding

You can write the basic code for PDF manipulation below.

import pdfplumber

with ('./') as pdf:
    for page in :
        print(page.extract_text())

        # of pages printed with one-page breaks
        print('---------- pagination ----------')

import (data)pdfplumber After the module, use the('./') Open the local pdf file, and then pass the Iterate through all the pages before passing thepage object(used form a nominal expression).extract_text() method to extract text information.

() The signature of the method is shown below:

("File name", password = "Password.", laparams = { "line_overlap": 0.7 })

where each parameter is described as follows:

  • file_name: filename, mandatory parameter;
  • password: Password for the PDF;
  • laparams: Layout parameters.

In addition, if you wish to read PDFs, you can also use theload() method, which also returns the instance of the class.

object instance, there are two main important properties:

  • .metadata: Get a dictionary of metadata key/value pairs from a PDF Info. Usually includes "CreationDate", "ModDate", "Producer", etc;
  • .pages: Includes A list of instances, each representing information on each page of the PDF.

aforementioned The instances arepdfplumber The core of the subsequent operation of the PDF of a large number of properties and methods around the implementation of the class, the important properties are shown below:

  • page_number: The page numbering sequence, the first page is numbered 1;
  • witdh: Width;
  • height: Height;
  • .objects/.chars/.lines/.rects/.curves/.figures/.images: Get important data in PDF pages.

The core methodology is shown below:

  • extract_text(): Extracts text from a page;
  • extract_words(): Extract all words and their related information;
  • extract_tables(): Extracts the table of the page.

extract_text() Presenting the results

extract_words() Presenting the results

extract_tables() rendering, since there are no tables in the PDF, all you get is empty for each page!

replenishment

Of course, Python in addition to reading PDF files, there are a number of other features, such as encryption of PDF, rotate and stack the page, etc., the following is the realization of the sample code

Rotate and overlay pages

import PyPDF2

from  import PageObject

# Create a Reader object to read PDF files

reader = ('resources/')

# Create a Writer object to write PDF files

writer = ()

# Cyclic traversal of all pages of a PDF file

for page_num in range():

      # Get the Page object for the specified page number

      current_page = (page_num) # type: PageObject

      if page_num % 2 == 0:

         # Odd pages rotated 90 degrees clockwise #

         current_page.rotateClockwise(90)

      else:

# Even pages rotated 90 degrees counterclockwise

             current_page.rotateCounterClockwise(90)

      (current_page)

# Finally add a blank page and rotate it 90 degrees

page = () # type: PageObject

(90)

# Writer object through the write method will be written to the PDF file

with open('resources/', 'wb') as file:

    (file)

encryptedPDFfile

import PyPDF2

reader = ('resources/')
writer = ()

for page_num in range():
     ((page_num))

# Encrypt PDF files through the encrypt method, the parameters of the method is rre

# set password
('foobared')

with open('resources/', 'wb') as file:

       (file)

Add watermark in batch

import PyPDF2

from  import PageObject

reader1 = ('resources/') reader2 = ('resources/')
writer = ()

# Get Watermark Page
watermark_page = (0)

for page_num in range():

     current_page = (page_num) # type: PageObject                                           current_page.mergePage(watermark_page)

     # Merge original and watermarked pages
     (current_page) 

# Write PDF to file
with open('resources/', 'wb') as file:

       (file)

to this article on this article to teach you to use Python to read PDF files on the article is introduced to this, more related to Python to read PDF files, please search for my previous articles or continue to browse the following articles I hope that you will support me in the future!