In this paper, we share the example of python to achieve pdf to word/txt, for your reference, the details are as follows
Dependency Package:pdfminer3k
It can be installed via pip; you can also go toofficial websiteDownload, unzip, go to the folder and enter the command install to install the software.
Source Code:
#!/usr/bin/python # -*- coding: utf-8 -*- import sys import importlib (sys) from import PDFParser,PDFDocument from import PDFResourceManager, PDFPageInterpreter from import PDFPageAggregator from import * from import PDFTextExtractionNotAllowed ''''' Parsing pdf files, get the file contains a variety of objects ''' # Parse pdf file function def parse(pdf_path): fp = open(pdf_path, 'rb') # Open in binary read mode # Use the file object to create a pdf document analyzer parser = PDFParser(fp) # Create a PDF document doc = PDFDocument() # Connection Analyzer with Document Objects parser.set_document(doc) doc.set_parser(parser) # Provide initialization passwords # If there's no password, create an empty string # () # Detect whether the document provides txt conversion, do not provide on the ignore if not doc.is_extractable: raise PDFTextExtractionNotAllowed else: # Create PDf Explorer to manage shared resources rsrcmgr = PDFResourceManager() # Create a PDF device object laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) # Create a PDF interpreter object interpreter = PDFPageInterpreter(rsrcmgr, device) # Used to count the number of objects such as pages, images, curves, figures, horizontal text boxes, etc. num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0, 0 # Loop over the list, processing the contents of one page at a time for page in doc.get_pages(): # doc.get_pages() get list of pages num_page += 1 # Add one to the page interpreter.process_page(page) # Accept the LTPage object for this page layout = device.get_result() for x in layout: if isinstance(x,LTImage): # Picture objects num_image += 1 if isinstance(x,LTCurve): # Curve objects num_curve += 1 if isinstance(x,LTFigure): # figure objects num_figure += 1 if isinstance(x, LTTextBoxHorizontal): # Get text content num_TextBoxHorizontal += 1 # Horizontal text box objects add one # Save text content with open(r'', 'a',encoding='utf-8') as f: # Generate the file name and path of the doc file results = x.get_text() (results) ('\n') print('Number of objects: \n','Number of pages: %s\n'%num_page,'Number of pictures: %s\n'%num_image,'Number of curves: %s\n'%num_curve,'Horizontal text box: %s\n' %num_TextBoxHorizontal) if __name__ == '__main__': pdf_path = r'' #pdf file path and file name parse(pdf_path)
This script can only convert pdf files to plain text files without any formatting.
This is the whole content of this article.