Python code examples of using PyMuPDF to manipulate PDF

Install PyMuPDF

First, you need to install the PyMuPDF library. You can use pip to install it:

pip install pymupdf

Read PDF files

Read PDF files and print their page count:

import fitz

# Use pymupdf to read pdf filesif __name__ == '__main__':
    # Open PDF file    doc = ('')

    print(doc.page_count)

Extract text

Extract text from a PDF file:

import fitz

# Use Pymupdf to read pdf textif __name__ == '__main__':
    # Open PDF file    doc = ('')

    print(doc.load_page(0).get_text())

Split PDF files

Split the PDF file into two files, one with odd pages and one with even pages:

import fitz

# Use pymupdf to split the PDF file into two files, one file contains odd pages and the other file contains even pages:if __name__ == '__main__':
    # Create a PDF writer object    odd_writer = ()
    even_writer = ()

    doc = ('')

    for page_num in range(doc.page_count):
        if page_num % 2 == 0:
            odd_writer.insert_pdf(doc, from_page=page_num, to_page=page_num)
        else:
            even_writer.insert_pdf(doc, from_page=page_num, to_page=page_num)

    odd_writer.save('')
    even_writer.save('')

Merge PDF files

You can merge multiple PDF files into one:

import fitz

# Merge two pdf files using PymuPDFif __name__ == '__main__':
    # Open the PDF file to be merged    pdf_files = ['', '']

    # Create a new PDF document object    merged_doc = ()

    # traverse each PDF file to be merged    for pdf_file in pdf_files:
        # Open the current PDF file        temp_doc = (pdf_file)
        # Add all pages of the current PDF file to the merged document        for page_num in range(len(temp_doc)):
            merged_doc.insert_pdf(temp_doc, from_page=page_num, to_page=page_num)
        # Close the current PDF file (no need to save it, because we just read it)        temp_doc.close()

    # Save the merged PDF file    merged_doc.save("")

Crop PDF pages

PyPDF2 does not directly support cropping pages, but you can do this by extracting part of the page and creating a new page. Here is a simple example that demonstrates how to crop the upper half of a page:

import fitz

# Use pymupdf to crop PDF pagesif __name__ == '__main__':

    # Open PDF file    doc = ("")

    # Select the page to crop (for example, the first page)    page = doc.load_page(0)

    # Define crop area (rectangle, format [x0, y0, x1, y1])    # Here we cut the upper part of the page    rect = [.x0, .y0, .x1, .y0 + ( / 2)]

    # Crop the page (this will change the original page)    page.set_cropbox(rect)
    page.clean_contents()  # Clean up page content (optional, but recommended)
    # Save the modified PDF file    ("cropped_example.pdf")

Encrypt PDF

import fitz

# Encrypt pdf using PyPDF2if __name__ == '__main__':
    doc = ("")

    # Set encryption parameters    (
        "",
        encryption=fitz.PDF_ENCRYPT_AES_256,  # Encryption Algorithm        user_pw="password123",  # User password (open password)        owner_pw="password123",  # Owner Password        permissions=0b1111000000,  # Permission flag        garbage=3,  # Clean up redundant data        deflate=True,  # Compress content    )

Comparison with pypdf2

PyPDF2：

Basic operations: Focus on PDF merging, splitting, page rotation, encryption/decryption, adding watermarks and other basic functions.
Text extraction: Supports simple text extraction, but weak support for complex layouts (such as double column layout, tables) may damage text order.
Lightweight: Suitable for lightweight tasks such as quickly merging multiple documents or adding password protection.
Large file processing: The speed of processing large files (such as more than 7000 pages) is slower (takes hundreds of seconds) and has a high memory usage.
Compatibility of complex documents: The processing of complex graphics, forms or encrypted files may fail, and text extraction is prone to garbled code.

PyMuPDF：

All-round processing: supports PDF reading, editing, merging, and splitting, and can extract text, images, and tables, and even supports OCR recognition and PDF to images.
Advanced features: parsing tables (preserving list structure), processing comments and forms, generating PDF/A format documents, and implementing OCR through Tesseract integration.
Multi-format support: compatible with PDF, XPS, CBZ and other formats, and is applicable to a wider range of scenarios.
Large file processing: Based on the MuPDF engine, optimization algorithm is used, and processing the same file takes only a few seconds, supports multi-threaded acceleration, and has higher image rendering efficiency.
Complex document compatibility: more stable performance when processing scanned documents and encrypted documents, and can retain the original order of double-column text

The above is the detailed content of the code example of Python using PyMuPDF to operate PDF. For more information about Python PyMuPDF operation PDF, please follow my other related articles!