Implementation Steps to Delete Redundant or Blank Pages in PDFs with Python

introduction

When working with PDF files, you often encounter some extra or blank pages. These pages not only occupy storage space, but also affect the neatness and readability of the document. This article will explore how to implement the following features using Python:

Delete redundant pages in PDF
Delete blank pages in PDF (including completely blank and visually blank pages)

Why do you need to delete redundant or blank pages in PDF?

Save storage space: Deleting useless pages can reduce file size and save storage space.
Improve document readability: Deleting blank pages or useless content can improve the continuity and readability of documents.
Simplify printing and sharing: After removing useless pages, the documents are more concise and printing and sharing are more convenient.

Tools required

In order to remove redundant or blank pages in PDFs in Python, you need to use the following two libraries:

for Python: A powerful PDF processing library that supports loading, modifying and saving PDF documents.
Pillow (PIL): A powerful image processing library for assisting in detecting blank pages on visuals.

Environmental preparation

Before you begin, make sure that the above libraries are installed. You can run the following command in the terminal to install:

pip install  pillow

How to use Python to delete redundant pages in PDF

Implementation ideas

Delete the corresponding page by specifying the index list of pages to be deleted.
To avoid index misalignment, reverse order traversal is used when deleting.

Detailed implementation steps

Create a PDF document object through the PdfDocument() class.
Use the () method to load the specified PDF file.
Reversely traverse the specified page index list and use the () method to delete the corresponding page.
Use the () method to save the modified PDF to the specified path.

Implement code

from  import *
 
# Define function: Delete the specified redundant pagedef delete_specific_pages(input_file, output_file, pages_to_delete):
    """
    Delete the specified redundant page。
    parameter：
        input_file (str): enterPDFFile path。
        output_file (str): OutputPDFFile path（After deleting the pagePDF）。
        pages_to_delete (list of int): List of page indexes to delete（Index from0start）                                      
    """
    # Create PDF document object    pdf = PdfDocument()
    # Load the specified PDF file    (input_file)
 
    # traverse the specified page index list in reverse order to avoid index misalignment during deletion    for index in sorted(pages_to_delete, reverse=True):
        if 0 &lt;= index &lt; :  # Make sure the index is within the valid range            # Delete the specified page according to the index            (index)
        else:
            print(f"warn：index {index} Out of page scope，Skipped。")
 
    # Save the modified PDF to the specified path    (output_file)
    # Close PDF document and release resources    ()
 
# Call method to delete page 1 and page 3 in PDF (indexes 0 and 2)delete_specific_pages("Test.pdf", "Delete redundant pages.pdf", [0, 2])

How to detect and delete blank pages in PDF using Python

Implementation ideas

Delete completely blank pages: Use the () method to detect a completely blank page, that is, a page with no visible or invisible content, and then delete it.
Delete visually blank pages: Some pages contain invisible content (such as white text or transparent layers) and look blank to the naked eye. Convert this type of page into an image and analyze the pixel value of the image through the Pillow library to determine whether it is blank. If it is blank, delete the corresponding PDF page.

Detailed implementation steps

Create a PdfDocument instance and load the PDF file.
Iterates over all pages in the document in reverse order.
Detect blank pages and delete them:
- Use the () method to detect a completely blank page and delete it using the () method.
- Use the () method to convert other pages into pictures, and analyze the pixel values of the picture through the Pillow library to determine whether it is blank. If it is a blank picture, use the () method to delete the corresponding blank page from the PDF.
Use the () method to save the modified PDF to the specified path.

Implement code

import io
from  import PdfDocument, License
from PIL import Image
 
# Set the license key (you can get the free license key from this URL: /misc/)# If there is no license key, there will be a watermark on the converted picture, which will affect the judgment of the blank page.("License-Key")
 
# Custom function: Detect whether the image is blankdef is_blank_image(image):
    """
    Check whether the picture is blank。
    parameter：
        image (): To be testedPILImage object。
    return：
        bool: If the picture is completely blank（All white pixels），则returnTrue；否则returnFalse。
    """
    # Convert image to RGB mode    img = ("RGB")
    white_pixel = (255, 255, 255)
    # Check whether all pixels are white    return all(pixel == white_pixel for pixel in ())
 
# Define function: Remove blank pages from PDFdef remove_blank_pages(input_file, output_file):
    """
    From the specifiedPDFDelete blank pages in the file（Completely blank or visually blank page）。
    parameter：
        input_file (str): enterPDFFile path。
        output_file (str): OutputPDFFile path（After deleting the blank pagePDF）。
    """
    # Create PDF document object    pdf = PdfDocument()
    # Load the specified PDF file    (input_file)
 
    # traverse each page in reverse order    for i in range( - 1, -1, -1):
        page = [i]
 
        # Detect completely blank pages and delete them        if ():
            (i)
        else:
            # Convert other pages to images            with (i) as image_data:
                image_bytes = image_data.ToArray()
                pil_image = ((image_bytes))
 
            # Detect whether it is a visual blank page            if is_blank_image(pil_image):
                (i)
 
    # Save the modified PDF to the specified path    (output_file)
    # Close PDF document and release resources    ()
 
# Call method to delete blank pages in PDFremove_blank_pages("Test.pdf", "Delete blank page.pdf")

The above is all the content of using Python to delete unnecessary pages and blank pages in PDF.

This is the article about the implementation steps of using Python to delete unnecessary or blank pages in PDFs. For more related contents of Python to delete unnecessary or blank pages in PDFs, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!