SoFunction
Updated on 2024-11-19

python realize pdf into word/txt plain text files

In this paper, we share the example of python to achieve pdf to word/txt, for your reference, the details are as follows

Dependency Package:pdfminer3k

It can be installed via pip; you can also go toofficial websiteDownload, unzip, go to the folder and enter the command install to install the software.

Source Code:

#!/usr/bin/python 
# -*- coding: utf-8 -*- 
 
import sys 
import importlib 
(sys) 
 
from  import PDFParser,PDFDocument 
from  import PDFResourceManager, PDFPageInterpreter 
from  import PDFPageAggregator 
from  import * 
from  import PDFTextExtractionNotAllowed 
 
'''''
Parsing pdf files, get the file contains a variety of objects
''' 
 
# Parse pdf file function
def parse(pdf_path): 
  fp = open(pdf_path, 'rb') # Open in binary read mode
  # Use the file object to create a pdf document analyzer
  parser = PDFParser(fp) 
  # Create a PDF document
  doc = PDFDocument() 
  # Connection Analyzer with Document Objects
  parser.set_document(doc) 
  doc.set_parser(parser) 
 
  # Provide initialization passwords
  # If there's no password, create an empty string #
  () 
 
  # Detect whether the document provides txt conversion, do not provide on the ignore
  if not doc.is_extractable: 
    raise PDFTextExtractionNotAllowed 
  else: 
    # Create PDf Explorer to manage shared resources
    rsrcmgr = PDFResourceManager() 
    # Create a PDF device object
    laparams = LAParams() 
    device = PDFPageAggregator(rsrcmgr, laparams=laparams) 
    # Create a PDF interpreter object
    interpreter = PDFPageInterpreter(rsrcmgr, device) 
 
    # Used to count the number of objects such as pages, images, curves, figures, horizontal text boxes, etc.
    num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0, 0 
 
    # Loop over the list, processing the contents of one page at a time
    for page in doc.get_pages(): # doc.get_pages() get list of pages
      num_page += 1 # Add one to the page
      interpreter.process_page(page) 
      # Accept the LTPage object for this page
      layout = device.get_result() 
      for x in layout: 
        if isinstance(x,LTImage): # Picture objects
          num_image += 1 
        if isinstance(x,LTCurve): # Curve objects
          num_curve += 1 
        if isinstance(x,LTFigure): # figure objects
          num_figure += 1 
        if isinstance(x, LTTextBoxHorizontal): # Get text content
          num_TextBoxHorizontal += 1 # Horizontal text box objects add one
          # Save text content
          with open(r'', 'a',encoding='utf-8') as f:  # Generate the file name and path of the doc file
            results = x.get_text() 
            (results) 
            ('\n') 
    print('Number of objects: \n','Number of pages: %s\n'%num_page,'Number of pictures: %s\n'%num_image,'Number of curves: %s\n'%num_curve,'Horizontal text box: %s\n' 
       %num_TextBoxHorizontal) 
 
 
if __name__ == '__main__': 
  pdf_path = r'' #pdf file path and file name
  parse(pdf_path) 

This script can only convert pdf files to plain text files without any formatting.

This is the whole content of this article.