As PDF documents are mostly read-only documents, sometimes in order to meet the need to edit the PDF file can usually be directly converted to Word documents for operation.
Looked at the network above the python conversion of PDF files to Word of the relevant articles feel more complex, and the use of some charts but also special processing.
This article mainly explains about how to use python is to realize the business process of converting PDF to Word, this time without the use of GUI applications for the operation.
Due to the possibility of version conflicts, the versions of the python non-standard libraries that need to be used during development are listed here.
- python kernel version: 3.6.8
- PyMuPDF Version: 1.18.17
- pdf2docx version: 0.5.1
You can choose to install the non-standard python libraries by piping them.
pip install PyMuPDF==1.18.17 pip install pdf2docx==0.5.1
After completing the installation of the above python dependency libraries, import pdf2docx into our code block.
# Importing the Converter class from the pdf2docx module. from pdf2docx import Converter
Then, write a business function of the code block, a new pdfToWord function to deal with the conversion logic, mainly on a few lines of code can be achieved relatively simple.
def pdfToWord(pdf_file_path=None, word_file_path=None): """ It takes a pdf file path and a word file path as input, and converts the pdf file to a word file. :param pdf_file_path: The path to the PDF file you want to convert :param word_file_path: The path to the word file that you want to create """ # Creating a Converter object. converter_ = Converter(pdf_file_path) # The `convert` method takes the path to the word file that you want to create, and the start and end pages of the PDF # file that you want to convert. converter_.convert(word_file_path, start=0, end=None) converter_.close()
Finally, the use of the main function to call pdfToWord function can be directly completed document format conversion.
# A special variable in Python that evaluates to `True` if the module is being run directly by the Python interpreter, and # `False` if it has been imported by another module. if __name__ == '__main__': pdfToWord('D:/test-data-work/test_pdf.pdf', 'D:/test-data-work/test_pdf.docx') # Parsing Page 2: 2/5...Ignore Line "∑" due to overlap # Ignore Line "∑" due to overlap # Ignore Line "ç" due to overlap # Ignore Line "A" due to overlap # Ignore Line "i =1" due to overlap # Ignore Line "æ" due to overlap # Parsing Page 5: 5/5... # Creating Page 5: 5/5... # -------------------------------------------------- # Terminated in 3.2503201s.
Methodological additions
In addition to the above methods, I also prepared for you other methods, the need for partners can understand the
Method I:
from pdf2docx import Converter import PySimpleGUI as sg def pdf2word(file_path): file_name = file_path.split('.')[0] doc_file = f'{file_name}.docx' p2w = Converter(file_path) (doc_file, start=0, end=None) () return doc_file def main(): # Selection of topics ('DarkAmber') layout = [ [('pdfToword', font=('Microsoft Black', 12)), ('', key='filename', size=(50, 1), font=('Microsoft Black', 10))], [(size=(80, 10), font=('Microsoft Black', 10))], [('Select file', key='file', target='filename'), ('Start conversion'), ('Exit')]] # Create windows window = ("Zhang Crouching Tiger", layout, font=("Microsoft Black.", 15), default_element_size=(50, 1)) # Event loop while True: # Window read with two return values (1. event; 2. value) event, values = () print(event, values) if event == "Begin conversion.": if values['file'] and values['file'].split('.')[1] == 'pdf': filename = pdf2word(values['file']) print('Number of documents: 1') print('\n' + 'Conversion successful!' + '\n') print('File save location:', filename) elif values['file'] and values['file'].split(';')[0].split('.')[1] == 'pdf': print('Number of documents :{}'.format(len(values['file'].split(';')))) for f in values['file'].split(';'): filename = pdf2word(f) print('\n' + 'Conversion successful!' + '\n') print('File save location:', filename) else: print('Please choose the pdf format file oh!') if event in (None, 'Exit'): break () main()
Method II:
Encrypted PDF to word
#-*- coding: UTF-8 -*- #!/usr/bin/python #-*- coding: utf-8 -*- import sys import importlib (sys) from import PDFParser,PDFDocument from import PDFResourceManager, PDFPageInterpreter from import PDFPageAggregator from import * from import PDFTextExtractionNotAllowed import os # Setting up the working directory folder (r'c:/users/dicey/desktop/codes/pdf-docx') # Parse pdf file function def parse(pdf_path): fp = open('', 'rb') # Open in binary read mode # Use the file object to create a pdf document analyzer parser = PDFParser(fp) # Create a PDF document doc = PDFDocument() # Connection Analyzer with Document Objects parser.set_document(doc) doc.set_parser(parser) # Provide initialization passwords # If there's no password, create an empty string # () # Detect whether the document provides txt conversion, do not provide on the ignore if not doc.is_extractable: raise PDFTextExtractionNotAllowed else: # Create PDf Explorer to manage shared resources rsrcmgr = PDFResourceManager() # Create a PDF device object laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) # Create a PDF interpreter object interpreter = PDFPageInterpreter(rsrcmgr, device) # Used to count the number of objects such as pages, images, curves, figures, horizontal text boxes, etc. num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0, 0 # Loop over the list, processing the contents of one page at a time for page in doc.get_pages(): # doc.get_pages() get list of pages num_page += 1 # Add one to the page interpreter.process_page(page) # Accept the LTPage object for this page layout = device.get_result() for x in layout: if isinstance(x,LTImage): # Picture objects num_image += 1 if isinstance(x,LTCurve): # Curve objects num_curve += 1 if isinstance(x,LTFigure): # figure objects num_figure += 1 if isinstance(x, LTTextBoxHorizontal): # Get text content num_TextBoxHorizontal += 1 # Horizontal text box objects add one # Save text content with open(r'', 'a',encoding='utf-8') as f: # Generate the file name and path of the doc file results = x.get_text() (results) ('\n') print('Number of objects: \n','Number of pages: %s\n'%num_page,'Number of pictures: %s\n'%num_image,'Number of curves: %s\n'%num_curve,'Horizontal text box: %s\n' %num_TextBoxHorizontal) if __name__ == '__main__': pdf_path = r'' #pdf file path and file name parse(pdf_path)
This article on the realization of this Python PDF to Word method explained the article is introduced to this, more related Python PDF to Word content please search my previous articles or continue to browse the relevant articles below I hope that you will support me more in the future!