Install PyMuPDF
First, you need to install the PyMuPDF library. You can use pip to install it:
pip install pymupdf
Read PDF files
Read PDF files and print their page count:
import fitz # Use pymupdf to read pdf filesif __name__ == '__main__': # Open PDF file doc = ('') print(doc.page_count)
Extract text
Extract text from a PDF file:
import fitz # Use Pymupdf to read pdf textif __name__ == '__main__': # Open PDF file doc = ('') print(doc.load_page(0).get_text())
Split PDF files
Split the PDF file into two files, one with odd pages and one with even pages:
import fitz # Use pymupdf to split the PDF file into two files, one file contains odd pages and the other file contains even pages:if __name__ == '__main__': # Create a PDF writer object odd_writer = () even_writer = () doc = ('') for page_num in range(doc.page_count): if page_num % 2 == 0: odd_writer.insert_pdf(doc, from_page=page_num, to_page=page_num) else: even_writer.insert_pdf(doc, from_page=page_num, to_page=page_num) odd_writer.save('') even_writer.save('')
Merge PDF files
You can merge multiple PDF files into one:
import fitz # Merge two pdf files using PymuPDFif __name__ == '__main__': # Open the PDF file to be merged pdf_files = ['', ''] # Create a new PDF document object merged_doc = () # traverse each PDF file to be merged for pdf_file in pdf_files: # Open the current PDF file temp_doc = (pdf_file) # Add all pages of the current PDF file to the merged document for page_num in range(len(temp_doc)): merged_doc.insert_pdf(temp_doc, from_page=page_num, to_page=page_num) # Close the current PDF file (no need to save it, because we just read it) temp_doc.close() # Save the merged PDF file merged_doc.save("")
Crop PDF pages
PyPDF2 does not directly support cropping pages, but you can do this by extracting part of the page and creating a new page. Here is a simple example that demonstrates how to crop the upper half of a page:
import fitz # Use pymupdf to crop PDF pagesif __name__ == '__main__': # Open PDF file doc = ("") # Select the page to crop (for example, the first page) page = doc.load_page(0) # Define crop area (rectangle, format [x0, y0, x1, y1]) # Here we cut the upper part of the page rect = [.x0, .y0, .x1, .y0 + ( / 2)] # Crop the page (this will change the original page) page.set_cropbox(rect) page.clean_contents() # Clean up page content (optional, but recommended) # Save the modified PDF file ("cropped_example.pdf")
Encrypt PDF
import fitz # Encrypt pdf using PyPDF2if __name__ == '__main__': doc = ("") # Set encryption parameters ( "", encryption=fitz.PDF_ENCRYPT_AES_256, # Encryption Algorithm user_pw="password123", # User password (open password) owner_pw="password123", # Owner Password permissions=0b1111000000, # Permission flag garbage=3, # Clean up redundant data deflate=True, # Compress content )
Comparison with pypdf2
PyPDF2:
Basic operations: Focus on PDF merging, splitting, page rotation, encryption/decryption, adding watermarks and other basic functions.
Text extraction: Supports simple text extraction, but weak support for complex layouts (such as double column layout, tables) may damage text order.
Lightweight: Suitable for lightweight tasks such as quickly merging multiple documents or adding password protection.
Large file processing: The speed of processing large files (such as more than 7000 pages) is slower (takes hundreds of seconds) and has a high memory usage.
Compatibility of complex documents: The processing of complex graphics, forms or encrypted files may fail, and text extraction is prone to garbled code.
PyMuPDF:
All-round processing: supports PDF reading, editing, merging, and splitting, and can extract text, images, and tables, and even supports OCR recognition and PDF to images.
Advanced features: parsing tables (preserving list structure), processing comments and forms, generating PDF/A format documents, and implementing OCR through Tesseract integration.
Multi-format support: compatible with PDF, XPS, CBZ and other formats, and is applicable to a wider range of scenarios.
Large file processing: Based on the MuPDF engine, optimization algorithm is used, and processing the same file takes only a few seconds, supports multi-threaded acceleration, and has higher image rendering efficiency.
Complex document compatibility: more stable performance when processing scanned documents and encrypted documents, and can retain the original order of double-column text
The above is the detailed content of the code example of Python using PyMuPDF to operate PDF. For more information about Python PyMuPDF operation PDF, please follow my other related articles!