1 Introduction
MinerU is a domestic tool that converts PDF into machine-readable formats (such as markdown, json), which can be easily extracted into any format. Currently, it supports images (.jpg and .png), PDF, Word (.doc and .docx), and PowerPoint (.ppt and .pptx).
# Official website address/en/latest/ #Github Address/opendatalab/mineru # API interface address/en/latest/user_guide/quick_start/convert_pdf.html # Model download script address# Download the model from ModelScope: download_models.py# Download the model from HuggingFace: download_models_hf.py/opendatalab/MinerU/tree/master/scripts
2 Install MinerU
Install Python environment
#My version is: magic-pdf==1.1.0pip install -U "magic-pdf[full]" -i /simple
Download weight
The official website provides two methods for downloading HuggingFace and ModelScope. This article is downloaded from ModlScope.
# Official website download method address/opendatalab/MinerU/blob/master/docs/how_to_download_models_zh_cn.md
Start downloading weights
⚠️ Note: After the model is downloaded, the script will automatically generate files in the user directory and automatically configure the default model path. You can find the file under [User Directory].
# Install model scopepip install modelscope # Download the filewget /gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py # You can also go to the address below to find download_models.py to download/opendatalab/MinerU/tree/master/scripts # Execute the download model# For the convenience of using the model, I modified download_models.py and added to set the position of the model.python download_models.py
Modified download_models.py
⚠️ You can not do this step.
In the filelocal_dir
It is the location of my newly added download model. If it is not set, it will be downloaded to the following directory: the user directory of Windows is "C:\Users\username", and the user directory of Linux is "/home/username".
import json import os import requests from modelscope import snapshot_download def download_json(url): # Download JSON file response = (url) response.raise_for_status() # Check whether the request is successful return () def download_and_modify_json(url, local_filename, modifications): if (local_filename): data = (open(local_filename)) config_version = ('config_version', '0.0.0') if config_version < '1.1.1': data = download_json(url) else: data = download_json(url) # Modify content for key, value in (): data[key] = value # Save the modified content with open(local_filename, 'w', encoding='utf-8') as f: (data, f, ensure_ascii=False, indent=4) if __name__ == '__main__': mineru_patterns = [ "models/Layout/LayoutLMv3/*", "models/Layout/YOLO/*", "models/MFD/YOLO/*", "models/MFR/unimernet_small_2501/*", "models/TabRec/TableMaster/*", "models/TabRec/StructEqTable/*", ] # Set the location of the model download local_dir="E:/mineru" # Download the model model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns, local_dir=local_dir) layoutreader_model_dir = snapshot_download('ppaanngggg/layoutreader', local_dir=local_dir) model_dir = model_dir + '/models' print(f'model_dir is: {model_dir}') print(f'layoutreader_model_dir is: {layoutreader_model_dir}') json_url = '/gh/opendatalab/MinerU@master/' config_file_name = '' home_dir = ('~') config_file = (home_dir, config_file_name) json_mods = { 'models-dir': model_dir, 'layoutreader-model-dir': layoutreader_model_dir, } download_and_modify_json(json_url, config_file, json_mods) print(f'The configuration file has been configured successfully, the path is: {config_file}')
3 Python uses MinerU
After installing MinerU in Python, you can directly execute the following code. The weights and parameters of the PaddleOCR model will be automatically downloaded during the first execution. The PaddleOCR model will be automatically downloaded to the user directory..paddleocr
In the directory.
The Python code for parsing PDF files is as follows:
import os from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader from magic_pdf. import PymuDocDataset from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze from magic_pdf. import SupportedPdfParseMethod # pdf file pathpdf_file_path = "E:/hello/" # Get the pdf file name without suffixpdf_file_path_without_suff = pdf_file_path.split(".")[0] print(pdf_file_path_without_suff) # The directory where the file residespdf_file_path_parent_dir = (pdf_file_path) image_dir = (pdf_file_path_parent_dir, "images") print(image_dir) # Write instance of Markdown# markdown_dir = "./output/markdown" # writer_markdown = FileBasedDataWriter(markdown_dir) writer_markdown = FileBasedDataWriter() # Read pictureswriter_image = FileBasedDataWriter(image_dir) # Read file streamreader_pdf = FileBasedDataReader("") bytes_pdf = reader_pdf.read(pdf_file_path) # Process datadataset_pdf = PymuDocDataset(bytes_pdf) # Determine whether ocr is supportedif dataset_pdf.classify() == : # Support OCR infer_result = dataset_pdf.apply(doc_analyze, ocr=True) pipe_result = infer_result.pipe_ocr_mode(writer_image) else: # OCR is not supported infer_result = dataset_pdf.apply(doc_analyze, ocr=False) pipe_result = infer_result.pipe_txt_mode(writer_image) # Use the model to parse text on every pageinfer_result.draw_model(pdf_file_path) # Get the results after model processingmodel_inference_result = infer_result.get_infer_res() print(model_inference_result) # Generate a pdf file with color annotation layout for pdfpipe_result.draw_layout(f"{pdf_file_path_without_suff}_layout.pdf") # Generate a pdf file with colored text lines for pdfpipe_result.draw_span(f"{pdf_file_path_without_suff}_spans.pdf") # Get the content of markdownmarkdown_content = pipe_result.get_markdown(image_dir) print(markdown_content) # Save markdown file# pipe_result.dump_md(writer_markdown, f"{pdf_file_path_without_suff}.md", image_dir) pipe_result.dump_md(writer_markdown, f"{pdf_file_path_without_suff}.md", image_dir) # json text list# Data types include type, text, text_level, page_idx, img_path, etc.content_list_content = pipe_result.get_content_list(image_dir) print(content_list_content) # Save json text listpipe_result.dump_content_list(writer_markdown, f"{pdf_file_path_without_suff}_content_list.json", image_dir) # Get json text with location informationmiddle_json_content = pipe_result.get_middle_json() # Save json text with location informationpipe_result.dump_middle_json(writer_markdown, f'{pdf_file_path_without_suff}_middle.json')
This is the end of this article about a simple example of Python using MinerU. For more related content on Python using MinerU, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!