Python to achieve batch extraction of tables in Word

Forms in the word document is one of the common document elements. Operation of word documents sometimes need to extract the contents of multiple forms in the document to a new document, and even sometimes will have to extract the title of the note information.

Today, to share with you two batch extraction of documents in the form of two methods, respectively, VBA method and Python method of two.

First, the VBA method to extract the word in the table

1. Code realization

VBA (Visual Basic for Applications) operation of Word files, you can perform a variety of tasks, including creating, opening, saving, modifying text and formatting. Today, we use VBA to batch extract the current file in the table, add a blank line in the middle of each form. The realization of the code is as follows:

 Sub ExtractTablesAndPreviousRowToNewFile()
 Dim docSource As Document
 Dim docTarget As Document
 Dim tbl As Table
 Dim rng As Range
 Dim outputPath As String
 Dim fileName As String
 
 ' Setting the output file name and path
 fileName = ""
 outputPath =  & "\" & fileName
 
 ' The current document is set as the source document
 Set docSource = ActiveDocument
 ' Create a new document as the target document
 Set docTarget = 
 
 For Each tbl In 
 
 ' Copying Forms




 ' Add a blank line after the table
 
 
 Next tbl
 
 ' Delete the first empty paragraph in the target document
 If  > 0 Then
 (1).
 End If
 
 ' Save the new document to the specified path
 docTarget.SaveAs2 fileName:=outputPath, FileFormat:=wdFormatXMLDocument
 
 
 MsgBox "The table and the row above it have been successfully extracted to " & outputPath, vbInformation
End Sub

2. Code analysis

The above code first activates the current document as the source document and then creates a new document to hold the extracted tables and captions. It iterates through all the tables in the source document and for each table, tries to copy the table itself to the target document.

After each table, a blank line is inserted to maintain a clear visual separation between multiple tables in the document.

3. Methods of use

First, open the document you want to extract the table from in Word, then press Alt + F11 to open the VBA editor. In the [Project] pane, select your document and insert a new module (right click on your document name and select [Insert] > [Module]). Copy and paste the above VBA code into the new module. Close the VBA editor, and then run the macro (in Word, you can go through [View] > [Macro] > [View Macro], select this macro, and then click [Run].

II. Python method

Python in office office automation has a very wide range of uses, it has special libraries to deal with the various components in office, and these are open source and free to use. operation of word files should be used python-docx library, in the preparation of the program before the installation of a new version of the Python program, and then in the cmd under the pip install python-docx to install this library, you can also be in the thonny this light version of the integrated development environment for the operation of the word python-docx installed! files.

1. Code realization

We first import the Document module from the docx, and then read the specified word file, extract the form and its contents to a new file and save. The realization of the code is as follows:

from docx import Document
import os
 
def extract_tables(doc_path, output_path):
    # Load the original document
    doc = Document(doc_path)
    new_doc = Document()
 
    # Extract the table and add it to a new document
    for i, table in enumerate():
        t = new_doc.add_table(rows=1, cols=len())
         = 'Table Grid'  # Use the built-in table styles, which will automatically add the box lines
        # Duplicate table headers
        for j, cell in enumerate([0].cells):
            (0, j).text = 
        # Copy other lines
        for row in [1:]:
            new_row = t.add_row()
            for j, cell in enumerate():
                new_row.cells[j].text = 
 
        # Add a blank line (empty paragraph) after each table except the last one
        if i < len() - 1:
            new_doc.add_paragraph()
 
    # Save new document
    new_doc.save(output_path)
 
# Examples of use
extract_tables('', '')

2. Code analysis

The above code to extract all the forms in the document to the document, while using the form of the built-in styles, to the newly generated form automatically add a box line, the basic realization of the extraction of the form of the text content, but the form of the font color, size and border styles can not be extracted. At the same time, there is no extraction of the title of the note, then we need to further modify the code to make it possible to extract the form above the title of the note.

3. Extraction of notes and table contents

This code will be centered on the form of the text above the default recognition of the form of the title of the note, extract the contents of the form will be extracted together.

from docx import Document
from  import WD_ALIGN_PARAGRAPH
import os
 
def extract_tables_with_titles(doc_path, output_path):
    # Load the original document
    doc = Document(doc_path)
    new_doc = Document()
 
    # Extract the table and add it to a new document
    for i, table in enumerate():
        # Try to locate and copy the centered text above the table
        # Find the paragraph before the table
        para = table._element.getprevious()
        if para is not None and ('p'):
            # Check if the paragraph is formatted as centered
            p = ()
            para_obj = [p for p in  if p._element == para][0]
            if para_obj.alignment == WD_ALIGN_PARAGRAPH.CENTER:
                # Add a centered paragraph to a new document
                new_para = new_doc.add_paragraph(para_obj.text)
                new_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
 
        # Add table
        t = new_doc.add_table(rows=1, cols=len())
         = 'Table Grid'  # Use the built-in table styles, which will automatically add the box lines
 
        # Duplicate table headers
        for j, cell in enumerate([0].cells):
            (0, j).text = 
        # Copy other lines
        for row in [1:]:
            new_row = t.add_row()
            for j, cell in enumerate():
                new_row.cells[j].text = 
 
        # Add a blank line (empty paragraph) after each table except the last one
        if i < len() - 1:
            new_doc.add_paragraph()
 
    # Save new document
    new_doc.save(output_path)
 
# Examples of use
extract_tables_with_titles('', '')

The above code on the basis of the original code to increase the extraction of the contents of the title note, by calling extract_tables_with_titles this function, batch the file in the table and the title extracted and put into the middle of the table, and each table will have an empty line between.

III. Post-learning reflections

The use of VBA and Python can realize the function of table content extraction, but for the text and the style of the table can not be completely extracted. Later, we will further explore how to copy the complete table contents and styles, but due to the special formatting involved in fonts and so on, it will be difficult to extract the styles.

The advantage of the above two methods is that they can extract table contents in batch and efficiently, but they cannot extract styles, so they may report errors when extracting complex tables.

The above code by default is to batch extract the form to the current directory, VBA code is applied to the current word file, and python is required to extract the file name, if you want to batch extract multiple files in the form you also need to add a for loop to traverse all the word file.

This article on Python to achieve batch extraction of Word in the form of the article is introduced to this, more relevant Python extract Word form content please search my previous posts or continue to browse the following related articles I hope you will support me in the future more!