Python implements common text content extraction

1. Introduction

In daily work and study, we often need to extract text from PDF and Word documents, such as data analysis and text processing. If you do these operations manually, it will not only be time-consuming and labor-intensive, but also prone to errors. Therefore, it becomes particularly important to write a text content extraction. This article will introduce how to use Python to write a text content extraction tool that can extract text from PDF and Word documents.

2. The principle of text content extraction

The core principle of text content extraction is to traverse all files in the specified directory, use the corresponding library to extract text according to the file type (PDF or Word), and then save the extracted text to the specified directory. In this process, we need to consider the following issues:

How to iterate through all files in a specified directory?

How to extract text based on file type?

How to save extracted text?

Next, we will introduce solutions to these three problems separately.

3. Design of text content extraction

When designing text content extraction, we need to consider the following aspects:

User interface: In order to facilitate users to use, we can design a simple command line interface, allowing users to enter directories, output directories and other parameters.

File traversal: We need to write a file traversal to traverse all files in the specified directory.

Text Extraction: We need to write a text extract that is used to extract text based on file type.

Text Save: We need to write a text save to save the extracted text to the specified directory.

IV. Implementation of text content extraction

Next, we will introduce in detail the implementation process of text content extraction. For convenience, we will write this tool in Python.

1. User interface

We can use Python's argparse library to design a simple command line interface. The interface includes the following parts:

Directory parameters: Let the user specify the directory where the file to extract text is located.

Output directory parameters: Let the user specify the directory to which the extracted text is saved.

2. File traversal

We can use Python's os library to iterate through all files in the specified directory. The specific implementation is as follows:

import os
def traverse_dir(dir_path):
    file_list = []
    for root, dirs, files in (dir_path):
        for file in files:
            file_list.append((root, file))
    return file_list

3. Text extraction

For PDF files, we can use Python's PyPDF2 library to extract text. The specific implementation is as follows:

import PyPDF2
def extract_text_from_pdf(pdf_path, output_path):
    with open(pdf_path, 'rb') as file:
        pdf_reader = (file)
        for page_num in range(pdf_reader.numPages):
            page = pdf_reader.getPage(page_num)
            text = ()
            with open(output_path, 'a', encoding='utf-8') as output_file:
                output_file.write(text)

For Word documents, we can use Python's python-docx library to extract text. The specific implementation is as follows:

from docx import Document
def extract_text_from_docx(docx_path, output_path):
    doc = Document(docx_path)
    text = []
    for para in :
        ()
    with open(output_path, 'a', encoding='utf-8') as output_file:
        output_file.write('\n'.join(text))

4. Text Save

We can use Python's() function to save the extracted text. The specific implementation is as follows:

import os
def save_text(text, output_path):
    with open(output_path, 'w', encoding='utf-8') as output_file:
        output_file.write(text)

5. Complete code example

import argparse
import os
import PyPDF2
from docx import Document
def traverse_dir(dir_path):
    file_list = []
    for root, dirs, files in (dir_path):
        for file in files:
            file_list.append((root, file))
    return file_list
def extract_text_from_pdf(pdf_path, output_path):
    with open(pdf_path, 'rb') as file:
        pdf_reader = (file)
        for page_num in range(pdf_reader.numPages):
            page = pdf_reader.getPage(page_num)
            text = ()
            with open(output_path, 'a', encoding='utf-8') as output_file:
                output_file.write(text)
def extract_text_from_docx(docx_path, output_path):
    doc = Document(docx_path)
    text = []
    for para in :
        ()
    with open(output_path, 'a', encoding='utf-8') as output_file:
        output_file.write('\n'.join(text))
def save_text(text, output_path):
    with open(output_path, 'w', encoding='utf-8') as output_file:
        output_file.write(text)
def main():
    parser = (description="Text content extraction")
    parser.add_argument("directory", help="Specify Directory")
    parser.add_argument("output_directory", help="Specify the output directory")
    args = parser.parse_args()
    dir_path = 
    output_dir = args.output_directory
    file_list = traverse_dir(dir_path)
    for file_path in file_list:
        if file_path.lower().endswith(('.pdf')):
            extract_text_from_pdf(file_path, output_dir)
        elif file_path.lower().endswith(('.docx', '.doc')):
            extract_text_from_docx(file_path, output_dir)
if __name__ == "__main__":
    main()

The above is the detailed content of Python's commonly used text content extraction. For more information about Python's text content extraction, please pay attention to my other related articles!