Installation of the pdfminer library
Install pdfminer3k under windows
pip install pdfminer3k
Install pdfminer under Liunx
pip install pdfminer
coding
from import PDFParser, PDFDocument from import PDFPageAggregator from import LAParams, LTTextBoxHorizontal from import PDFTextExtractionNotAllowed, PDFResourceManager, PDFPageInterpreter def pdfParse(path): """ pdf text extraction :param path: file path :return: list of results per page """ fp = open(path, 'rb') # Open in binary read mode # Use the file object to create a pdf document analyzer praser = PDFParser(fp) # Create a PDF document doc = PDFDocument() # Connection Analyzer with Document Objects praser.set_document(doc) doc.set_parser(praser) # Provide initialization passwords # If there's no password, create an empty string # () # Detect whether the document provides txt conversion, do not provide on the ignore if not doc.is_extractable: raise PDFTextExtractionNotAllowed else: # Create PDf Explorer to manage shared resources rsrcmgr = PDFResourceManager() # Create a PDF device object laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) # Create a PDF interpreter object interpreter = PDFPageInterpreter(rsrcmgr, device) # Text content per page results = [] # Loop over the list, processing the contents of one page at a time for page in doc.get_pages(): # doc.get_pages() get list of pages interpreter.process_page(page) # Accept the LTPage object for this page layout = device.get_result() # Here layout is an LTPage object which holds various objects parsed from this page, including LTTextBox, LTFigure, LTImage, LTTextBoxHorizontal, etc. If you want to get the text, you get the text property of the object. for x in layout: if isinstance(x, LTTextBoxHorizontal): (x.get_text()) return results
The library is based on an iterative pdf of each page for text extraction, you can also recognize the function of the judgment page number
There is also a pypdf2 library that can recognize it, but it doesn't feel as accurate as this one.
This is the whole content of this article, I hope it will help you to learn more.