SoFunction
Updated on 2025-05-19

Detailed descriptions and common uses of pypdf library in Python

Preface

pypdfis a pure Python library used to process PDF files. It supports reading, modifying, merging, splitting, encrypting and extracting PDF files' text, metadata, and page content.pypdfyesPyPDF2 Successor(2022 Rename and reconstruction in 2018),Provides a more modern API and higher performance,Suitable for simple handling PDF Operation tasks。

The following is correctpypdfDetailed descriptions and common usages of the library.

1. The role of pypdf library

  • Read PDF: Extract text, metadata, number of pages, etc.
  • Modify PDF: Merge, split, rotate, and crop pages.
  • Create PDF: Generate new PDF or add content (such as text, watermark).
  • Encryption/decryption: Set a password for the PDF or unlock a protected PDF.
  • Cross-platform: Pure Python implementation, no external dependencies (such as Adobe Acrobat).

2. Installation and environmental requirements

  • Python version: Supports Python 3.6+ (recommended 3.8+).
  • rely: No forced external dependencies, optional dependencies:
    • Pillow: Process images in PDF.
    • pycryptodome: Support encryption/decryption.
  • Installation command
    pip install pypdf
    
  • Optional extension
    pip install pypdf[image]  # Include Pillowpip install pypdf[crypto]  # Contain pycryptodome
  • Verify installation
    import pypdf
    print(pypdf.__version__)  # Sample output: 5.0.1

3. Core functions and usage

pypdfThe core categories includePdfReader(Read PDF),PdfWriter(Modify/create PDF) andPdfMerger(Merge PDF). The following are the main features and examples.

3.1 Read PDF

usePdfReaderRead PDF files and extract metadata, page count and text.

from pypdf import PdfReader

# Open PDF filereader = PdfReader("")

# Get metadatametadata = 
print(metadata)  # Output: {'/Title': 'Example PDF', '/Author': 'John Doe', ...}
# Get the number of pagesprint(len())  # Output page count
# Extract the first page of textpage = [0]
print(page.extract_text())

illustrate

  • Returns PDF metadata (such as title, author).
  • It is a page list,pages[i]Return to page i (starting from 0).
  • page.extract_text()Extract page text (effect depends on PDF structure and may be incomplete).

3.2 Merge PDF

usePdfMergerorPdfWriterMerge multiple PDF files.

from pypdf import PdfMerger

# Create a mergermerger = PdfMerger()

# Add PDF file("")
("")

# Save the merge result("")
()

Alternative (using PdfWriter)

from pypdf import PdfReader, PdfWriter

writer = PdfWriter()
for pdf in ["", ""]:
    reader = PdfReader(pdf)
    for page in :
        writer.add_page(page)

with open("", "wb") as f:
    (f)

illustrate

  • PdfMergerMore suitable for simple merge tasks.
  • PdfWriterProvides more flexible control.

3.3 Split PDF

Split PDF into a single page or a specified range.

from pypdf import PdfReader, PdfWriter

reader = PdfReader("")

# Split each page into a separate PDFfor i, page in enumerate():
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as f:
        (f)

illustrate

  • Each page is saved as a separate file.
  • A specific page range can be selected through the index.

3.4 Rotate the page

Rotate the PDF's page.

from pypdf import PdfReader, PdfWriter

reader = PdfReader("")
writer = PdfWriter()

# Rotate the first page 90 degreespage = [0]
(90)
writer.add_page(page)

# Save the resultswith open("", "wb") as f:
    (f)

illustrate

  • (angle)Accept the angle (clockwise, multiple of 90).
  • Add the rotated page to a new PDF.

3.5 Encryption/decryption PDF

Set a password for the PDF or unlock a protected PDF.

from pypdf import PdfReader, PdfWriter

# Encrypt PDFreader = PdfReader("")
writer = PdfWriter()

for page in :
    writer.add_page(page)

(user_password="my_password", algorithm="AES-256")
with open("", "wb") as f:
    (f)

# Decrypt PDFreader = PdfReader("")
if reader.is_encrypted:
    ("my_password")
print([0].extract_text())

illustrate

  • encryptsupportRC4-128andAES-256algorithm.
  • decryptThe correct password is required.

3.6 Extract images

Extract images from PDF (need to be installedPillow)。

from pypdf import PdfReader

reader = PdfReader("")
page = [0]
for img in :
    with open(f"image_{}", "wb") as f:
        ()

illustrate

  • Returns the image object in the page.
  • is the binary data of the image.

4. Performance and Features

  • Efficiency: Pure Python implementation, fast startup, no external tools required.
  • Memory efficiency: Processed page by page, suitable for large PDFs.
  • flexibility: Supports page-level operations and metadata modification.
  • limitation
    • Text extraction effects depend on PDF structure, and complex formats (such as scanning documents) may fail.
    • It does not support direct editing of PDF content (such as modifying text), and needs to be combined with other libraries (such asreportlab)。

5. Practical application scenarios

  • Document processing: Merge reports, split chapters, and extract metadata.
  • Automated workflow: Batch processing of PDFs (such as adding watermarks, encryption).
  • Data Extraction: Extract text or images from PDF for analysis.
  • E-book management: Adjust page order or crop margins.
  • Security Management: Set password for sensitive documents.

Example (extract all page text)

from pypdf import PdfReader

reader = PdfReader("")
text = ""
for page in :
    text += page.extract_text() or ""
print(text[:200])  # The first 200 characters output

6. Things to note

  • Text extraction
    • Scan or image-type PDFs must first use OCR tools (such aspytesseract)deal with.
    • Complex layouts can lead to text order errors.
  • Encryption restrictions
    • Some high-strength encryption may requirepycryptodome
    • The correct password is required for decryption, otherwise an exception will be thrown.
  • File path
    • Ensure the file path is correct, it is recommended to usepathlibOr absolute path.
  • Version compatibility
    • pypdf(≥3.0.0) andPyPDF2Incomplete compatible, old code needs to be adjusted.
    • The latest version (5.0.1, as of 2025) optimizes performance and API.
  • Error handling
    • deal withFileNotFoundError(The file does not exist).
    • deal withPdfReadError(Files are corrupted or encrypted).

7. Comprehensive example

Here is a comprehensive example showing reading, merging, encrypting, and extracting text:

from pypdf import PdfReader, PdfWriter, PdfMerger

# Read PDF metadata and textreader = PdfReader("")
print("Metadata:", )
print("Page count:", len())
print("First page text:", [0].extract_text()[:100])

# Merge multiple PDFsmerger = PdfMerger()
("")
("")
("")
()

# Encrypted and merged PDFreader = PdfReader("")
writer = PdfWriter()
for page in :
    writer.add_page((90))  # Rotate the page(user_password="secret", algorithm="AES-256")
with open("encrypted_rotated.pdf", "wb") as f:
    (f)

illustrate

  • Read metadata and text.
  • Merge two PDFs.
  • Rotate the page and encrypt the output.

8. Resources and Documents

  • Official Documentation/
  • GitHub repository/py-pdf/pypdf
  • PyPI page/project/pypdf/
  • Tutorial/en/stable/user/
  • Migration Guide (from PyPDF2)/en/stable/user/

This is the article about the detailed description and common usage of pypdf library in Python. For more related Python pypdf library to process PDF files, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!