Core code analysis
1. Color conversion function: rgb_to_hex
Set the RGB color value (e.g.(255, 0, 0)
) Convert to hexadecimal string (such as#FF0000
)。
There are defects in the current code: The original function directly returns RGB tuples, not hexadecimal strings. The following is the revised implementation:
def rgb_to_hex(rgb): """Convert RGB color to hexadecimal string (such as #FF0000)""" if rgb is None: return None return f"#{rgb[0]:02X}{rgb[1]:02X}{rgb[2]:02X}"
2. Parse a single shape: parse_shape
Extract shapesType, location, style, text contentand other information, supporting text boxes, tables, pictures and other types.
Key steps:
-
Basic properties: Shape type (e.g.
MSO_SHAPE_TYPE.TEXT_BOX
),Location(left
,top
),size(width
,height
)。 - Fill style: Color type (solid color/gradient), color value (hexadecimal).
-
Border style: Color, line width, dotted line type (such as
MSO_LINE_DASH_STYLE.DASH
)。 - Text Style: Paragraph alignment, font name, size, bold/italic, color.
-
Special treatment:
- sheet: parse rows, columns and cell contents.
- picture: Record size information.
Sample output (text box):
{ "type": 1, // MSO_SHAPE_TYPE.TEXT_BOX "name": "Text Box 1", "text": "Hello World", "fill": { "type": "MSO_FILL.SOLID", "color": "#FF0000" }, "line": { "color": "#000000", "width": 12700, // EMU units "dash_style": "MSO_LINE_DASH_STYLE.SOLID" }, "text_style": { "paragraphs": [ { "text": "Hello World", "runs": [ { "text": "Hello World", "font": { "name": "Arial", "size": 24, "bold": false, "color": "#000000" } } ] } ] } }
3. Analyze the entire PPT: parse_presentation
Iterate through every page and every shape of the PPT and call itparse_shape
Generate structured data.
Sample output (JSON snippet):
{ "slides": [ { "slide_number": 254, // PPT internal ID "shapes": [ { "type_name": "MSO_SHAPE_TYPE.TABLE", "table": [ [ {"text": "Header 1", "row_span": 1, "col_span": 1}, {"text": "Header 2", "row_span": 1, "col_span": 1} ], [ {"text": "Row 1", "row_span": 1, "col_span": 1}, {"text": "Data", "row_span": 1, "col_span": 1} ] ] } ] } ] }
4. Save as JSON: save_to_json
Save the parsed results as a file in a readable format.
def save_to_json(data, output_path): with open(output_path, "w", encoding="utf-8") as f: (data, f, indent=4, ensure_ascii=False)
Example of usage
1. Installation dependency
pip install python-pptx
2. Run the code
if __name__ == "__main__": input_pptx = "" # Enter the PPT path output_json = "presentation_info.json" # Output JSON path parsed_data = parse_presentation(input_pptx) save_to_json(parsed_data, output_json)
3. Output JSON structure
{ "slides": [ { "slide_number": 254, "shapes": [ { "type": 1, "name": "Text Box 1", "text": "Sample Text", "fill": {"type": "MSO_FILL.BACKGROUND", "color": null}, "line": {"color": null, "width": 0, "dash_style": null}, "text_style": { "paragraphs": [ { "text": "Sample Text", "runs": [ { "text": "Sample Text", "font": { "name": "Calibri", "size": 11, "bold": false, "color": "#000000" } } ] } ] } } ] } ] }
Core functions and applicable scenarios
1. Supported shape types
Type Name | Corresponding value | describe |
---|---|---|
MSO_SHAPE_TYPE.TEXT_BOX |
1 | Text box |
MSO_SHAPE_TYPE.TABLE |
19 | sheet |
MSO_SHAPE_TYPE.PICTURE |
17 | picture |
Other shapes | Enumeration according to MSO | Shapes, lines, etc. |
2. Typical application scenarios
- Style reuse: Extract template styles and batch generate PPTs that comply with specifications.
- Data migration: Export PPT content to JSON for data analysis or content management.
-
Automated generation: Dynamically generate PPT (reverse implementation required
apply
Function).
Things to note
-
Unit conversion:
- The unit of dimensions in PPT isEMU(English Metric Unit), available through
/9525
Convert to pixels (1 EMU ≈ 0.01 pixels).
- The unit of dimensions in PPT isEMU(English Metric Unit), available through
-
Exception handling:
- Some shapes may not have text boxes (such as pictures) and need to be passed
shape.has_text_frame
judge. - The color value may be
None
(If the fill is transparent), the default value must be set.
- Some shapes may not have text boxes (such as pictures) and need to be passed
-
Extension suggestions:
- Support more styles: Such as shadows, 3D effects.
- Reversely generate PPT: Reconstruct PPT file based on JSON data.
Summarize
Through the code and parsing of this article, developers can quickly realize automatic parsing and data extraction of PPT files. Whether it is academic research, enterprise reporting automation, or combining LLM to generate content, this toolchain can provide strong basic support. Next, you can try:
-
Generate content in combination with LLM: After generating text with GPT, fill it into JSON
text
Field. - Visual style: Render JSON data into a web page or chart for PPT design preview.
With the combination of Python and JSON, automation of PPT has never been easier!
import json from pptx import Presentation from import MSO_SHAPE_TYPE from import PP_ALIGN from import MSO_FILL, MSO_LINE_DASH_STYLE def rgb_to_hex(rgb): """Convert RGB color to hexadecimal string (such as #FF0000)""" if rgb is None: return None return (rgb[0], rgb[1], rgb[2]) def parse_shape(shape): """Resolve information for a single Shape""" data = { "type": shape.shape_type, "type_name": str(MSO_SHAPE_TYPE(shape.shape_type)), "name": , "left": , "top": , "width": , "height": , "rotation": , "text": "", "fill": {}, "line": {}, "text_style": {} } # parse fill styles fill = try: data["fill"] = { "type": str(MSO_FILL()), "color": rgb_to_hex(fill.fore_color.rgb) if fill.fore_color else None } except: data["fill"] = { "type": str(MSO_FILL()), "color": None } # parse border styles line = # try: data["line"] = { "color": rgb_to_hex() if else None, "width": , "dash_style": str(MSO_LINE_DASH_STYLE(line.dash_style)) if line.dash_style else None } # except: # print() # parse text styles (if text box exists) if shape.has_text_frame: text_frame = shape.text_frame paragraphs = [] for paragraph in text_frame.paragraphs: runs = [] for run in : run_data = { "text": , "font": { "name": , "size": , "bold": , "italic": , "color": rgb_to_hex() } } (run_data) paragraph_data = { "text": , "runs": runs, "level": , "alignment": str(PP_ALIGN()) if else None } (paragraph_data) data["text_style"] = { "paragraphs": paragraphs } data["text"] = text_frame.text # Processing form if shape.shape_type == MSO_SHAPE_TYPE.TABLE: table = rows = [] for row in : cells = [] for cell in : cell_data = { "text": (), "row_span": cell.row_span, "col_span": cell.col_span } (cell_data) (cells) data["table"] = rows # Process pictures if shape.shape_type == MSO_SHAPE_TYPE.PICTURE: data["image"] = { "width": , "height": } return data def parse_presentation(pptx_path): """Parse the entire PPT file and return the JSON structure""" prs = Presentation(pptx_path) presentation_data = { "slides": [] } for slide_idx, slide in enumerate(): slide_data = { "slide_number": slide.slide_id, "shapes": [] } for shape in : shape_data = parse_shape(shape) slide_data["shapes"].append(shape_data) presentation_data["slides"].append(slide_data) return presentation_data def save_to_json(data, output_path): """Save parsed data as a JSON file""" with open(output_path, "w", encoding="utf-8") as f: (data, f, indent=4, ensure_ascii=False) #User Exampleif __name__ == "__main__": input_pptx = "" # Enter the PPT file path output_json = "presentation_info.json" # Output JSON file path # parse PPT parsed_data = parse_presentation(input_pptx) # Save as JSON save_to_json(parsed_data, output_json)
The above is the detailed explanation of using Python to parse PPT files and generate JSON structure. For more information about Python parsing PPTs and generating JSON, please pay attention to my other related articles!