Python to achieve PDF to Word method explained

As PDF documents are mostly read-only documents, sometimes in order to meet the need to edit the PDF file can usually be directly converted to Word documents for operation.

Looked at the network above the python conversion of PDF files to Word of the relevant articles feel more complex, and the use of some charts but also special processing.

This article mainly explains about how to use python is to realize the business process of converting PDF to Word, this time without the use of GUI applications for the operation.

Due to the possibility of version conflicts, the versions of the python non-standard libraries that need to be used during development are listed here.

python kernel version: 3.6.8
PyMuPDF Version: 1.18.17
pdf2docx version: 0.5.1

You can choose to install the non-standard python libraries by piping them.

pip install PyMuPDF==1.18.17

pip install pdf2docx==0.5.1

After completing the installation of the above python dependency libraries, import pdf2docx into our code block.

# Importing the Converter class from the pdf2docx module.
from pdf2docx import Converter

Then, write a business function of the code block, a new pdfToWord function to deal with the conversion logic, mainly on a few lines of code can be achieved relatively simple.

def pdfToWord(pdf_file_path=None, word_file_path=None):
    """
    It takes a pdf file path and a word file path as input, and converts the pdf file to a word file.

    :param pdf_file_path: The path to the PDF file you want to convert
    :param word_file_path: The path to the word file that you want to create
    """
    # Creating a Converter object.
    converter_ = Converter(pdf_file_path)
    # The `convert` method takes the path to the word file that you want to create, and the start and end pages of the PDF
    # file that you want to convert.
    converter_.convert(word_file_path, start=0, end=None)
    converter_.close()

Finally, the use of the main function to call pdfToWord function can be directly completed document format conversion.

# A special variable in Python that evaluates to `True` if the module is being run directly by the Python interpreter, and
# `False` if it has been imported by another module.
if __name__ == '__main__':
    pdfToWord('D:/test-data-work/test_pdf.pdf', 'D:/test-data-work/test_pdf.docx')

# Parsing Page 2: 2/5...Ignore Line "∑" due to overlap
# Ignore Line "∑" due to overlap
# Ignore Line "ç" due to overlap
# Ignore Line "Ａ" due to overlap
# Ignore Line "ｉ ＝１" due to overlap
# Ignore Line "æ" due to overlap
# Parsing Page 5: 5/5...
# Creating Page 5: 5/5...
# --------------------------------------------------
# Terminated in 3.2503201s.

Methodological additions

In addition to the above methods, I also prepared for you other methods, the need for partners can understand the

Method I:

from pdf2docx import Converter
import PySimpleGUI as sg
 
 
def pdf2word(file_path):
    file_name = file_path.split('.')[0]
    doc_file = f'{file_name}.docx'
    p2w = Converter(file_path)
    (doc_file, start=0, end=None)
    ()
    return doc_file
 
 
def main():
    # Selection of topics
    ('DarkAmber')
 
    layout = [
        [('pdfToword', font=('Microsoft Black', 12)),
         ('', key='filename', size=(50, 1), font=('Microsoft Black', 10))],
        [(size=(80, 10), font=('Microsoft Black', 10))],
        [('Select file', key='file', target='filename'), ('Start conversion'), ('Exit')]]
    # Create windows
    window = ("Zhang Crouching Tiger", layout, font=("Microsoft Black.", 15), default_element_size=(50, 1))
    # Event loop
    while True:
        # Window read with two return values (1. event; 2. value)
        event, values = ()
        print(event, values)
 
        if event == "Begin conversion.":
 
            if values['file'] and values['file'].split('.')[1] == 'pdf':
                filename = pdf2word(values['file'])
                print('Number of documents: 1')
                print('\n' + 'Conversion successful!' + '\n')
                print('File save location:', filename)
            elif values['file'] and values['file'].split(';')[0].split('.')[1] == 'pdf':
                print('Number of documents ：{}'.format(len(values['file'].split(';'))))
                for f in values['file'].split(';'):
                    filename = pdf2word(f)
                    print('\n' + 'Conversion successful!' + '\n')
                    print('File save location:', filename)
            else:
                print('Please choose the pdf format file oh!')
        if event in (None, 'Exit'):
            break
 
    ()
main()

Method II:

Encrypted PDF to word

#-*- coding: UTF-8 -*- 
#!/usr/bin/python
#-*- coding: utf-8 -*-
import sys
import importlib
(sys)
from  import PDFParser,PDFDocument
from  import PDFResourceManager, PDFPageInterpreter
from  import PDFPageAggregator
from  import *
from  import PDFTextExtractionNotAllowed
import os
# Setting up the working directory folder
(r'c:/users/dicey/desktop/codes/pdf-docx')
# Parse pdf file function
def parse(pdf_path):
 fp = open('', 'rb') # Open in binary read mode
 # Use the file object to create a pdf document analyzer
 parser = PDFParser(fp)
 # Create a PDF document
 doc = PDFDocument()
 # Connection Analyzer with Document Objects
 parser.set_document(doc)
 doc.set_parser(parser)
 # Provide initialization passwords
 # If there's no password, create an empty string #
 ()
 # Detect whether the document provides txt conversion, do not provide on the ignore
 if not doc.is_extractable:
  raise PDFTextExtractionNotAllowed
 else:
  # Create PDf Explorer to manage shared resources
  rsrcmgr = PDFResourceManager()
  # Create a PDF device object
  laparams = LAParams()
  device = PDFPageAggregator(rsrcmgr, laparams=laparams)
  # Create a PDF interpreter object
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  # Used to count the number of objects such as pages, images, curves, figures, horizontal text boxes, etc.
  num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0, 0
  # Loop over the list, processing the contents of one page at a time
  for page in doc.get_pages(): # doc.get_pages() get list of pages
   num_page += 1 # Add one to the page
   interpreter.process_page(page)
   # Accept the LTPage object for this page
   layout = device.get_result()
   for x in layout:
    if isinstance(x,LTImage): # Picture objects
     num_image += 1
    if isinstance(x,LTCurve): # Curve objects
     num_curve += 1
    if isinstance(x,LTFigure): # figure objects
     num_figure += 1
    if isinstance(x, LTTextBoxHorizontal): # Get text content
     num_TextBoxHorizontal += 1 # Horizontal text box objects add one
     # Save text content
     with open(r'', 'a',encoding='utf-8') as f: # Generate the file name and path of the doc file
      results = x.get_text()
      (results)
      ('\n')
  print('Number of objects: \n','Number of pages: %s\n'%num_page,'Number of pictures: %s\n'%num_image,'Number of curves: %s\n'%num_curve,'Horizontal text box: %s\n'
    %num_TextBoxHorizontal)

if __name__ == '__main__':
 pdf_path = r'' #pdf file path and file name
 parse(pdf_path)

This article on the realization of this Python PDF to Word method explained the article is introduced to this, more related Python PDF to Word content please search my previous articles or continue to browse the relevant articles below I hope that you will support me more in the future!