real-world scenario
Python engineers in their daily work, often encountered in the parsing and processing of PDF documents, the actual demand is mainly divided into the following cases:
- Extract text from PDF
- Convert each page in a PDF to an image
- Convert word to PDF
- PDF generation, editing, import and export
- PDF online rendering
Except for the last item, which requires the cooperation of the front-end, the rest can be implemented directly on the python side.
This real-world choicepdfplumber
library for learning, you can install the library in advance, but there is a need to pay attention to, the library is mainly used to read PDF for operations, writing and editing can not be achieved, that is, in this paper, to learn a focus on PDF content extraction library.
> pip install pdfplumber -i /simple
pdfplumber
The library has the following features:
- You can access the details of any element in the PDF object;
- Text and tables can be extracted and the usage is simple;
- Integrated visual debugging.
Python PDF Practical Coding
You can write the basic code for PDF manipulation below.
import pdfplumber with ('./') as pdf: for page in : print(page.extract_text()) # of pages printed with one-page breaks print('---------- pagination ----------')
import (data)pdfplumber
After the module, use the('./')
Open the local pdf file, and then pass the Iterate through all the pages before passing thepage object(used form a nominal expression)
.extract_text()
method to extract text information.
() The signature of the method is shown below:
("File name", password = "Password.", laparams = { "line_overlap": 0.7 })
where each parameter is described as follows:
-
file_name
: filename, mandatory parameter; -
password
: Password for the PDF; -
laparams
: Layout parameters.
In addition, if you wish to read PDFs, you can also use theload()
method, which also returns the instance of the class.
object instance, there are two main important properties:
-
.metadata
: Get a dictionary of metadata key/value pairs from a PDF Info. Usually includes "CreationDate", "ModDate", "Producer", etc; -
.pages
: IncludesA list of instances, each representing information on each page of the PDF.
aforementioned The instances are
pdfplumber
The core of the subsequent operation of the PDF of a large number of properties and methods around the implementation of the class, the important properties are shown below:
-
page_number
: The page numbering sequence, the first page is numbered 1; -
witdh
: Width; -
height
: Height; -
.objects/.chars/.lines/.rects/.curves/.figures/.images
: Get important data in PDF pages.
The core methodology is shown below:
-
extract_text()
: Extracts text from a page; -
extract_words()
: Extract all words and their related information; -
extract_tables()
: Extracts the table of the page.
extract_text()
Presenting the results
extract_words()
Presenting the results
extract_tables()
rendering, since there are no tables in the PDF, all you get is empty for each page!
replenishment
Of course, Python in addition to reading PDF files, there are a number of other features, such as encryption of PDF, rotate and stack the page, etc., the following is the realization of the sample code
Rotate and overlay pages
import PyPDF2 from import PageObject # Create a Reader object to read PDF files reader = ('resources/') # Create a Writer object to write PDF files writer = () # Cyclic traversal of all pages of a PDF file for page_num in range(): # Get the Page object for the specified page number current_page = (page_num) # type: PageObject if page_num % 2 == 0: # Odd pages rotated 90 degrees clockwise # current_page.rotateClockwise(90) else: # Even pages rotated 90 degrees counterclockwise current_page.rotateCounterClockwise(90) (current_page) # Finally add a blank page and rotate it 90 degrees page = () # type: PageObject (90) # Writer object through the write method will be written to the PDF file with open('resources/', 'wb') as file: (file)
encryptedPDFfile
import PyPDF2 reader = ('resources/') writer = () for page_num in range(): ((page_num)) # Encrypt PDF files through the encrypt method, the parameters of the method is rre # set password ('foobared') with open('resources/', 'wb') as file: (file)
Add watermark in batch
import PyPDF2 from import PageObject reader1 = ('resources/') reader2 = ('resources/') writer = () # Get Watermark Page watermark_page = (0) for page_num in range(): current_page = (page_num) # type: PageObject current_page.mergePage(watermark_page) # Merge original and watermarked pages (current_page) # Write PDF to file with open('resources/', 'wb') as file: (file)
to this article on this article to teach you to use Python to read PDF files on the article is introduced to this, more related to Python to read PDF files, please search for my previous articles or continue to browse the following articles I hope that you will support me in the future!