SoFunction
Updated on 2024-12-13

Python use pymupdf to achieve PDF content search and display functions

summary

In our daily work and study, we may need to find and extract specific content in PDF files. In this article, we will introduce how to use the Python programming language and the wxPython GUI library to implement a simple PDF content search tool. We will use the PyMuPDF module to work with PDF files and combine it with wxPython to build a user-friendly interface. c:\pythoncode\new\

preliminary

Before you start, make sure you have installed Python and the corresponding modules. You can use pip to install the wxPython and PyMuPDF modules, see the official documentation for details.

Creating a GUI interface

We first need to create a GUI interface so that the user can select the PDF file to search and enter what to find. We use the wxPython library to create the interface.

def __init__(self, parent, title):
        super(PDFSearchFrame, self).__init__(parent, title=title, size=(800, 600))
        panel = (self)
        vbox = ()
        # Select File button
        file_picker = (panel, style=wx.FLP_OPEN|wx.FLP_FILE_MUST_EXIST)
        file_picker.Bind(wx.EVT_FILEPICKER_CHANGED, self.on_file_selected)
        (file_picker, 0, |, 10)
        # Input boxes and buttons
        hbox = ()
        self.search_text = (panel)
        search_button = (panel, label='Search')
        search_button.Bind(wx.EVT_BUTTON, self.on_search)
        (self.search_text, 1, |, 5)
        (search_button, 0, , 5)
        (hbox, 0, |, 10)
        # Display box
        self.display_text = (panel, style=wx.TE_MULTILINE|wx.TE_READONLY)
        (self.display_text, 1, |, 10)
        (vbox)
        ()

In the above code, we have created a file namedPDFSearchFrame window class, which inherits from wxPython's Class. In the constructor of this class, we create the components of the interface, including the select file button, the input and search buttons, and the display box.

PDF content search and extraction

Next, we need to add PDF content search and extraction functionality in the code. We will use the PyMuPDF module to work with PDF files.

# Import the required modules
import wx
import fitz
def on_search(self, event):
        search_text = self.search_text.GetValue()
        if not search_text or not self.pdf_path:
            return
        doc = (self.pdf_path)
        matches = []
        for page in doc:
            text = page.get_text().lower()
            if search_text.lower() in text:
                ((, text))
        self.display_text.SetValue('')
        if matches:
            for page_num, text in matches:
                self.display_text.AppendText(f"Page {page_num}:\n{text}\n\n")
        else:
            self.display_text.AppendText("No match found.")
        ()

In the above code, we have theon_search method adds code for PDF content search and extraction. First, we use the function opens the selected PDF file and iterates through the text content of each page. We then convert the text content to lowercase and check if the search text is in it. If suitable matches are found, we store them in thematches list. Finally, we display the results of the matches in the display box, and if no matches are found, we display the appropriate prompts.

All Codes

import wx
import fitz
class PDFSearchFrame():
    def __init__(self, parent, title):
        super(PDFSearchFrame, self).__init__(parent, title=title, size=(800, 600))
        panel = (self)
        vbox = ()
        # Select File button
        file_picker = (panel, style=wx.FLP_OPEN|wx.FLP_FILE_MUST_EXIST)
        file_picker.Bind(wx.EVT_FILEPICKER_CHANGED, self.on_file_selected)
        (file_picker, 0, |, 10)
        # Input boxes and buttons
        hbox = ()
        self.search_text = (panel)
        search_button = (panel, label='Search')
        search_button.Bind(wx.EVT_BUTTON, self.on_search)
        (self.search_text, 1, |, 5)
        (search_button, 0, , 5)
        (hbox, 0, |, 10)
        # Display box
        self.display_text = (panel, style=wx.TE_MULTILINE|wx.TE_READONLY)
        (self.display_text, 1, |, 10)
        (vbox)
        ()
    def on_file_selected(self, event):
        self.pdf_path = ()
    def on_search(self, event):
        search_text = self.search_text.GetValue()
        if not search_text or not self.pdf_path:
            return
        doc = (self.pdf_path)
        matches = []
        for page in doc:
            text = page.get_text().lower()
            if search_text.lower() in text:
                ((, text))
        self.display_text.SetValue('')
        if matches:
            for page_num, text in matches:
                self.display_text.AppendText(f"Page {page_num}:\n{text}\n\n")
        else:
            self.display_text.AppendText("No match found.")
        ()
if __name__ == '__main__':
    app = ()
    PDFSearchFrame(None, title="PDF Search")
    ()

running program

After completing the above steps, we can save and run this program. A window of PDF Content Search Tool with search function will pop up. We can select the PDF file to be searched, enter the content to be found, and click the Search button. The program will display the matching results in the display box, including the page number found and the corresponding text content.

summarize

This article describes how to use Python and wxPython library to implement a simple PDF content search tool. By combining the PyMuPDF module and the wxPython graphical interface, we can easily select a PDF file and enter the content to be found in the input box. The program will search for matches and extract the content of the found pages into the display box. This tool can help us quickly find and extract specific content in PDF files and improve work efficiency.

This article on the use of Python pymupdf to achieve PDF content search and display the function of the article is introduced to this, more relevant Python pymupdf content please search my previous posts or continue to browse the following articles hope that you will have more support for me in the future!