SoFunction
Updated on 2024-11-15

python extract pdf text based on pdfminer library code examples

Installation of the pdfminer library

Install pdfminer3k under windows

pip install pdfminer3k

Install pdfminer under Liunx

pip install pdfminer

coding

from  import PDFParser, PDFDocument
from  import PDFPageAggregator
from  import LAParams, LTTextBoxHorizontal
from  import PDFTextExtractionNotAllowed, PDFResourceManager, PDFPageInterpreter
def pdfParse(path):
"""
pdf text extraction
:param path: file path
:return: list of results per page
"""
fp = open(path, 'rb') # Open in binary read mode
# Use the file object to create a pdf document analyzer
praser = PDFParser(fp)
# Create a PDF document
doc = PDFDocument()
# Connection Analyzer with Document Objects
praser.set_document(doc)
doc.set_parser(praser)
# Provide initialization passwords
# If there's no password, create an empty string #
()
# Detect whether the document provides txt conversion, do not provide on the ignore
if not doc.is_extractable:
 raise PDFTextExtractionNotAllowed
else:
 # Create PDf Explorer to manage shared resources
 rsrcmgr = PDFResourceManager()
 # Create a PDF device object
 laparams = LAParams()
 device = PDFPageAggregator(rsrcmgr, laparams=laparams)
 # Create a PDF interpreter object
 interpreter = PDFPageInterpreter(rsrcmgr, device)
 # Text content per page
 results = []
 # Loop over the list, processing the contents of one page at a time
 for page in doc.get_pages(): # doc.get_pages() get list of pages
  interpreter.process_page(page)
  # Accept the LTPage object for this page
  layout = device.get_result()
  # Here layout is an LTPage object which holds various objects parsed from this page, including LTTextBox, LTFigure, LTImage, LTTextBoxHorizontal, etc. If you want to get the text, you get the text property of the object.
  for x in layout:
   if isinstance(x, LTTextBoxHorizontal):
    (x.get_text())
 return results

The library is based on an iterative pdf of each page for text extraction, you can also recognize the function of the judgment page number

There is also a pypdf2 library that can recognize it, but it doesn't feel as accurate as this one.

This is the whole content of this article, I hope it will help you to learn more.