Preface
pypdf
is a pure Python library used to process PDF files. It supports reading, modifying, merging, splitting, encrypting and extracting PDF files' text, metadata, and page content.pypdf
yesPyPDF2
Successor(2022 Rename and reconstruction in 2018),Provides a more modern API and higher performance,Suitable for simple handling PDF Operation tasks。
The following is correctpypdf
Detailed descriptions and common usages of the library.
1. The role of pypdf library
- Read PDF: Extract text, metadata, number of pages, etc.
- Modify PDF: Merge, split, rotate, and crop pages.
- Create PDF: Generate new PDF or add content (such as text, watermark).
- Encryption/decryption: Set a password for the PDF or unlock a protected PDF.
- Cross-platform: Pure Python implementation, no external dependencies (such as Adobe Acrobat).
2. Installation and environmental requirements
- Python version: Supports Python 3.6+ (recommended 3.8+).
-
rely: No forced external dependencies, optional dependencies:
-
Pillow
: Process images in PDF. -
pycryptodome
: Support encryption/decryption.
-
-
Installation command:
pip install pypdf
-
Optional extension:
pip install pypdf[image] # Include Pillowpip install pypdf[crypto] # Contain pycryptodome
-
Verify installation:
import pypdf print(pypdf.__version__) # Sample output: 5.0.1
3. Core functions and usage
pypdf
The core categories includePdfReader
(Read PDF),PdfWriter
(Modify/create PDF) andPdfMerger
(Merge PDF). The following are the main features and examples.
3.1 Read PDF
usePdfReader
Read PDF files and extract metadata, page count and text.
from pypdf import PdfReader # Open PDF filereader = PdfReader("") # Get metadatametadata = print(metadata) # Output: {'/Title': 'Example PDF', '/Author': 'John Doe', ...} # Get the number of pagesprint(len()) # Output page count # Extract the first page of textpage = [0] print(page.extract_text())
illustrate:
-
Returns PDF metadata (such as title, author).
-
It is a page list,
pages[i]
Return to page i (starting from 0). -
page.extract_text()
Extract page text (effect depends on PDF structure and may be incomplete).
3.2 Merge PDF
usePdfMerger
orPdfWriter
Merge multiple PDF files.
from pypdf import PdfMerger # Create a mergermerger = PdfMerger() # Add PDF file("") ("") # Save the merge result("") ()
Alternative (using PdfWriter):
from pypdf import PdfReader, PdfWriter writer = PdfWriter() for pdf in ["", ""]: reader = PdfReader(pdf) for page in : writer.add_page(page) with open("", "wb") as f: (f)
illustrate:
-
PdfMerger
More suitable for simple merge tasks. -
PdfWriter
Provides more flexible control.
3.3 Split PDF
Split PDF into a single page or a specified range.
from pypdf import PdfReader, PdfWriter reader = PdfReader("") # Split each page into a separate PDFfor i, page in enumerate(): writer = PdfWriter() writer.add_page(page) with open(f"page_{i+1}.pdf", "wb") as f: (f)
illustrate:
- Each page is saved as a separate file.
- A specific page range can be selected through the index.
3.4 Rotate the page
Rotate the PDF's page.
from pypdf import PdfReader, PdfWriter reader = PdfReader("") writer = PdfWriter() # Rotate the first page 90 degreespage = [0] (90) writer.add_page(page) # Save the resultswith open("", "wb") as f: (f)
illustrate:
-
(angle)
Accept the angle (clockwise, multiple of 90). - Add the rotated page to a new PDF.
3.5 Encryption/decryption PDF
Set a password for the PDF or unlock a protected PDF.
from pypdf import PdfReader, PdfWriter # Encrypt PDFreader = PdfReader("") writer = PdfWriter() for page in : writer.add_page(page) (user_password="my_password", algorithm="AES-256") with open("", "wb") as f: (f) # Decrypt PDFreader = PdfReader("") if reader.is_encrypted: ("my_password") print([0].extract_text())
illustrate:
-
encrypt
supportRC4-128
andAES-256
algorithm. -
decrypt
The correct password is required.
3.6 Extract images
Extract images from PDF (need to be installedPillow
)。
from pypdf import PdfReader reader = PdfReader("") page = [0] for img in : with open(f"image_{}", "wb") as f: ()
illustrate:
-
Returns the image object in the page.
-
is the binary data of the image.
4. Performance and Features
- Efficiency: Pure Python implementation, fast startup, no external tools required.
- Memory efficiency: Processed page by page, suitable for large PDFs.
- flexibility: Supports page-level operations and metadata modification.
-
limitation:
- Text extraction effects depend on PDF structure, and complex formats (such as scanning documents) may fail.
- It does not support direct editing of PDF content (such as modifying text), and needs to be combined with other libraries (such as
reportlab
)。
5. Practical application scenarios
- Document processing: Merge reports, split chapters, and extract metadata.
- Automated workflow: Batch processing of PDFs (such as adding watermarks, encryption).
- Data Extraction: Extract text or images from PDF for analysis.
- E-book management: Adjust page order or crop margins.
- Security Management: Set password for sensitive documents.
Example (extract all page text):
from pypdf import PdfReader reader = PdfReader("") text = "" for page in : text += page.extract_text() or "" print(text[:200]) # The first 200 characters output
6. Things to note
-
Text extraction:
- Scan or image-type PDFs must first use OCR tools (such as
pytesseract
)deal with. - Complex layouts can lead to text order errors.
- Scan or image-type PDFs must first use OCR tools (such as
-
Encryption restrictions:
- Some high-strength encryption may require
pycryptodome
。 - The correct password is required for decryption, otherwise an exception will be thrown.
- Some high-strength encryption may require
-
File path:
- Ensure the file path is correct, it is recommended to use
pathlib
Or absolute path.
- Ensure the file path is correct, it is recommended to use
-
Version compatibility:
-
pypdf
(≥3.0.0) andPyPDF2
Incomplete compatible, old code needs to be adjusted. - The latest version (5.0.1, as of 2025) optimizes performance and API.
-
-
Error handling:
- deal with
FileNotFoundError
(The file does not exist). - deal with
PdfReadError
(Files are corrupted or encrypted).
- deal with
7. Comprehensive example
Here is a comprehensive example showing reading, merging, encrypting, and extracting text:
from pypdf import PdfReader, PdfWriter, PdfMerger # Read PDF metadata and textreader = PdfReader("") print("Metadata:", ) print("Page count:", len()) print("First page text:", [0].extract_text()[:100]) # Merge multiple PDFsmerger = PdfMerger() ("") ("") ("") () # Encrypted and merged PDFreader = PdfReader("") writer = PdfWriter() for page in : writer.add_page((90)) # Rotate the page(user_password="secret", algorithm="AES-256") with open("encrypted_rotated.pdf", "wb") as f: (f)
illustrate:
- Read metadata and text.
- Merge two PDFs.
- Rotate the page and encrypt the output.
8. Resources and Documents
- Official Documentation:/
- GitHub repository:/py-pdf/pypdf
- PyPI page:/project/pypdf/
- Tutorial:/en/stable/user/
- Migration Guide (from PyPDF2):/en/stable/user/
This is the article about the detailed description and common usage of pypdf library in Python. For more related Python pypdf library to process PDF files, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!