SoFunction
Updated on 2025-04-29

Use Python to parse PPT files and generate JSON structure detailed explanation

Core code analysis

1. Color conversion function: rgb_to_hex

Set the RGB color value (e.g.(255, 0, 0)) Convert to hexadecimal string (such as#FF0000)。
There are defects in the current code: The original function directly returns RGB tuples, not hexadecimal strings. The following is the revised implementation:

def rgb_to_hex(rgb):
    """Convert RGB color to hexadecimal string (such as #FF0000)"""
    if rgb is None:
        return None
    return f"#{rgb[0]:02X}{rgb[1]:02X}{rgb[2]:02X}"

2. Parse a single shape: parse_shape

Extract shapesType, location, style, text contentand other information, supporting text boxes, tables, pictures and other types.

Key steps:

  • Basic properties: Shape type (e.g.MSO_SHAPE_TYPE.TEXT_BOX),Location(lefttop),size(widthheight)。
  • Fill style: Color type (solid color/gradient), color value (hexadecimal).
  • Border style: Color, line width, dotted line type (such asMSO_LINE_DASH_STYLE.DASH)。
  • Text Style: Paragraph alignment, font name, size, bold/italic, color.
  • Special treatment
    • sheet: parse rows, columns and cell contents.
    • picture: Record size information.

Sample output (text box):

{
  "type": 1,  // MSO_SHAPE_TYPE.TEXT_BOX
  "name": "Text Box 1",
  "text": "Hello World",
  "fill": {
    "type": "MSO_FILL.SOLID",
    "color": "#FF0000"
  },
  "line": {
    "color": "#000000",
    "width": 12700,  // EMU units    "dash_style": "MSO_LINE_DASH_STYLE.SOLID"
  },
  "text_style": {
    "paragraphs": [
      {
        "text": "Hello World",
        "runs": [
          {
            "text": "Hello World",
            "font": {
              "name": "Arial",
              "size": 24,
              "bold": false,
              "color": "#000000"
            }
          }
        ]
      }
    ]
  }
}

3. Analyze the entire PPT: parse_presentation

Iterate through every page and every shape of the PPT and call itparse_shapeGenerate structured data.

Sample output (JSON snippet):

{
  "slides": [
    {
      "slide_number": 254,  // PPT internal ID      "shapes": [
        {
          "type_name": "MSO_SHAPE_TYPE.TABLE",
          "table": [
            [
              {"text": "Header 1", "row_span": 1, "col_span": 1},
              {"text": "Header 2", "row_span": 1, "col_span": 1}
            ],
            [
              {"text": "Row 1", "row_span": 1, "col_span": 1},
              {"text": "Data", "row_span": 1, "col_span": 1}
            ]
          ]
        }
      ]
    }
  ]
}

4. Save as JSON: save_to_json

Save the parsed results as a file in a readable format.

def save_to_json(data, output_path):
    with open(output_path, "w", encoding="utf-8") as f:
        (data, f, indent=4, ensure_ascii=False)

Example of usage

1. Installation dependency

pip install python-pptx

2. Run the code

if __name__ == "__main__":
    input_pptx = ""  # Enter the PPT path    output_json = "presentation_info.json"  # Output JSON path    parsed_data = parse_presentation(input_pptx)
    save_to_json(parsed_data, output_json)

3. Output JSON structure

{
  "slides": [
    {
      "slide_number": 254,
      "shapes": [
        {
          "type": 1,
          "name": "Text Box 1",
          "text": "Sample Text",
          "fill": {"type": "MSO_FILL.BACKGROUND", "color": null},
          "line": {"color": null, "width": 0, "dash_style": null},
          "text_style": {
            "paragraphs": [
              {
                "text": "Sample Text",
                "runs": [
                  {
                    "text": "Sample Text",
                    "font": {
                      "name": "Calibri",
                      "size": 11,
                      "bold": false,
                      "color": "#000000"
                    }
                  }
                ]
              }
            ]
          }
        }
      ]
    }
  ]
}

Core functions and applicable scenarios

1. Supported shape types

Type Name Corresponding value describe
MSO_SHAPE_TYPE.TEXT_BOX 1 Text box
MSO_SHAPE_TYPE.TABLE 19 sheet
MSO_SHAPE_TYPE.PICTURE 17 picture
Other shapes Enumeration according to MSO Shapes, lines, etc.

2. Typical application scenarios

  • Style reuse: Extract template styles and batch generate PPTs that comply with specifications.
  • Data migration: Export PPT content to JSON for data analysis or content management.
  • Automated generation: Dynamically generate PPT (reverse implementation requiredapplyFunction).

Things to note

  1. Unit conversion

    • The unit of dimensions in PPT isEMU(English Metric Unit), available through/9525Convert to pixels (1 EMU ≈ 0.01 pixels).
  2. Exception handling

    • Some shapes may not have text boxes (such as pictures) and need to be passedshape.has_text_framejudge.
    • The color value may beNone(If the fill is transparent), the default value must be set.
  3. Extension suggestions

    • Support more styles: Such as shadows, 3D effects.
    • Reversely generate PPT: Reconstruct PPT file based on JSON data.

Summarize

Through the code and parsing of this article, developers can quickly realize automatic parsing and data extraction of PPT files. Whether it is academic research, enterprise reporting automation, or combining LLM to generate content, this toolchain can provide strong basic support. Next, you can try:

  • Generate content in combination with LLM: After generating text with GPT, fill it into JSONtextField.
  • Visual style: Render JSON data into a web page or chart for PPT design preview.

With the combination of Python and JSON, automation of PPT has never been easier!

import json
from pptx import Presentation
from  import MSO_SHAPE_TYPE
from  import PP_ALIGN
from  import MSO_FILL, MSO_LINE_DASH_STYLE



def rgb_to_hex(rgb):
    """Convert RGB color to hexadecimal string (such as #FF0000)"""
    if rgb is None:
        return None
    return (rgb[0], rgb[1], rgb[2])


def parse_shape(shape):
    """Resolve information for a single Shape"""
    data = {
        "type": shape.shape_type,
        "type_name": str(MSO_SHAPE_TYPE(shape.shape_type)),
        "name": ,
        "left": ,
        "top": ,
        "width": ,
        "height": ,
        "rotation": ,
        "text": "",
        "fill": {},
        "line": {},
        "text_style": {}
    }

    # parse fill styles    fill = 
    try:
        data["fill"] = {
            "type": str(MSO_FILL()),
            "color": rgb_to_hex(fill.fore_color.rgb) if fill.fore_color else None
        }
    except:
        data["fill"] = {
            "type": str(MSO_FILL()),
            "color": None
        }

    # parse border styles    line = 
    # try:
    data["line"] = {
        "color": rgb_to_hex() if  else None,
        "width": ,
        "dash_style": str(MSO_LINE_DASH_STYLE(line.dash_style)) if line.dash_style else None
    }
    # except:
    #     print()
    # parse text styles (if text box exists)    if shape.has_text_frame:
        text_frame = shape.text_frame
        paragraphs = []
        for paragraph in text_frame.paragraphs:
            runs = []
            for run in :
                run_data = {
                    "text": ,
                    "font": {
                        "name": ,
                        "size": ,
                        "bold": ,
                        "italic": ,
                        "color": rgb_to_hex()
                    }
                }
                (run_data)
            paragraph_data = {
                "text": ,
                "runs": runs,
                "level": ,
                "alignment": str(PP_ALIGN()) if  else None
            }
            (paragraph_data)
        data["text_style"] = {
            "paragraphs": paragraphs
        }
        data["text"] = text_frame.text

    # Processing form    if shape.shape_type == MSO_SHAPE_TYPE.TABLE:
        table = 
        rows = []
        for row in :
            cells = []
            for cell in :
                cell_data = {
                    "text": (),
                    "row_span": cell.row_span,
                    "col_span": cell.col_span
                }
                (cell_data)
            (cells)
        data["table"] = rows

    # Process pictures    if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
        data["image"] = {
            "width": ,
            "height": 
        }

    return data


def parse_presentation(pptx_path):
    """Parse the entire PPT file and return the JSON structure"""
    prs = Presentation(pptx_path)
    presentation_data = {
        "slides": []
    }

    for slide_idx, slide in enumerate():
        slide_data = {
            "slide_number": slide.slide_id,
            "shapes": []
        }
        for shape in :
            shape_data = parse_shape(shape)
            slide_data["shapes"].append(shape_data)
        presentation_data["slides"].append(slide_data)

    return presentation_data


def save_to_json(data, output_path):
    """Save parsed data as a JSON file"""
    with open(output_path, "w", encoding="utf-8") as f:
        (data, f, indent=4, ensure_ascii=False)


#User Exampleif __name__ == "__main__":
    input_pptx = ""  # Enter the PPT file path    output_json = "presentation_info.json"  # Output JSON file path
    # parse PPT    parsed_data = parse_presentation(input_pptx)

    # Save as JSON    save_to_json(parsed_data, output_json)

The above is the detailed explanation of using Python to parse PPT files and generate JSON structure. For more information about Python parsing PPTs and generating JSON, please pay attention to my other related articles!