SoFunction
Updated on 2025-05-11

Python uses Camelot to accurately obtain table data from PDF

Preface - Why PDF table data extraction is so important

In the field of data analysis and business intelligence, the tabular data in PDF documents is a huge "gold mine", but it has become a "nightmare" for data practitioners because of its closed format. From corporate financial reports to government statistics, from scientific research papers to market research reports, key information is often locked in PDF forms and cannot be used directly for analysis. Traditional methods such as manual copy and paste are not only inefficient, but also prone to errors; general PDF parsing tools are often unable to handle complex tables. As a Python library designed specifically for PDF table extraction, Camelot has become a right-hand assistant for data professionals with its precise table recognition capabilities and flexible configuration options. This article will give you a comprehensive introduction to the skills of using Camelot, from basic installation to advanced applications, and help you master the professional skills of PDF table data extraction.

1. Getting started with Camelot basics

1.1 Installation and Environment Configuration

The installation of Camelot is very simple, but some dependencies need to be paid attention to:

# Basic installationpip install camelot-py[cv]
# If PDF conversion function is requiredpip install ghostscript

For full functionality, make sure to install the following dependencies:

  • Ghostscript: for PDF file processing
  • OpenCV: for image processing and table detection
  • Tkinter: for visualization functions (optional)

On Windows systems, you also need to install Ghostscript separately and add it to the system path.

Basic import:

import camelot
import pandas as pd
import  as plt
import cv2

1.2 Basic table extraction

def extract_basic_tables(pdf_path, pages='1'):
    """Extract basic tables from PDF"""
    # Use stream mode to extract tables    tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    print(f"Detected {len(tables)} A form")
    # Basic information of the form    for i, table in enumerate(tables):
        print(f"\nsheet #{i+1}:")
        print(f"page number: {}")
        print(f"sheet区域: {}")
        print(f"Dimension: {}")
        print(f"Accuracy score: {}")
        print(f"Blank rate: {}")
        # Show the first few rows of the table        print("\nTable Preview:")
        print(())
    return tables
#User Exampletables = extract_basic_tables("financial_report.pdf", pages='1-3')

1.3 Comparison of Extraction Methods Stream vs Lattice

def compare_extraction_methods(pdf_path, page='1'):
    """Compare two extraction methods of Stream and Lattice"""
    # Use Stream method    stream_tables = camelot.read_pdf(pdf_path, pages=page, flavor='stream')
    # Use the Lattice method    lattice_tables = camelot.read_pdf(pdf_path, pages=page, flavor='lattice')
    # Compare results    print(f"Streammethod: Detected {len(stream_tables)} A form")
    print(f"Latticemethod: Detected {len(lattice_tables)} A form")
    # If a table is detected, compare the first table    if len(stream_tables) > 0 and len(lattice_tables) > 0:
        # Get the first form        stream_table = stream_tables[0]
        lattice_table = lattice_tables[0]
        # Compare accuracy and blank rate        print("\nComparison of accuracy and blank rate:")
        print(f"Stream - Accuracy: {stream_table.accuracy}, Blank rate: {stream_table.whitespace}")
        print(f"Lattice - Accuracy: {lattice_table.accuracy}, Blank rate: {lattice_table.whitespace}")
        # Compare table shapes        print("\nTable dimension comparison:")
        print(f"Stream: {stream_table.shape}")
        print(f"Lattice: {lattice_table.shape}")
        # Returns the table of two methods        return stream_tables, lattice_tables
    return None, None
#User Examplestream_tables, lattice_tables = compare_extraction_methods("report_with_tables.pdf")

2. Advanced form extraction technology

2.1 Accurately locate table areas

def extract_table_with_area(pdf_path, page='1', table_area=None):
    """Extract tables using precise area coordinates"""
    if table_area is None:
        # The default value covers the entire page        table_area = [0, 0, 100, 100]  # [x1, y1, x2, y2] expressed as percentage    # Use the Stream method to extract the table in the specified area    tables = camelot.read_pdf(
        pdf_path,
        pages=page,
        flavor='stream',
        table_areas=[f"{table_area[0]},{table_area[1]},{table_area[2]},{table_area[3]}"]
    )
    print(f"Detected in the specified area {len(tables)} A form")
    # Show the first table    if len(tables) > 0:
        print("\nTable Preview:")
        print(tables[0].())
    return tables
# Use example - Extract the table at approximately the middle of the pagetables = extract_table_with_area("financial_report.pdf", table_area=[10, 30, 90, 70])

2.2 Handling complex tables

def extract_complex_tables(pdf_path, page='1'):
    """Advanced configuration for handling complex tables"""
    # Use the Lattice method to handle complex tables with borders    lattice_tables = camelot.read_pdf(
        pdf_path,
        pages=page,
        flavor='lattice',
        line_scale=40,  # Adjust line detection sensitivity        process_background=True,  # Processing background        line_margin=2  # Line interval tolerance    )
    # Use Stream method to handle complex tables without borders    stream_tables = camelot.read_pdf(
        pdf_path,
        pages=page,
        flavor='stream',
        edge_tol=500,  # Edge tolerance        row_tol=10,    #Train tolerance        column_tol=10  # Column tolerance    )
    print(f"Latticemethod: Detected {len(lattice_tables)} A form")
    print(f"Streammethod: Detected {len(stream_tables)} A form")
    # Choose the best result    best_tables = lattice_tables if lattice_tables[0].accuracy > stream_tables[0].accuracy else stream_tables
    return best_tables
#User Examplecomplex_tables = extract_complex_tables("complex_financial_report.pdf")

2.3 Table visualization and debugging

def visualize_table_extraction(pdf_path, page='1'):
    """Visualize the table extraction process to help debug and optimize"""
    # Extract form    tables = camelot.read_pdf(pdf_path, pages=page)
    # Check whether the form is successfully extracted    if len(tables) == 0:
        print("Not detected")
        return
    # Get the first form    table = tables[0]
    # Show the table    print(f"Table shape: {}")
    print(f"Accuracy: {}")
    # Draw the table structure    plot = (kind='grid')
    (f"Table grid structure - Accuracy: {}")
    plt.tight_layout()
    ('table_grid.png')
    ()
    # Draw table cells    plot = (kind='contour')
    (f"Table cell structure - Blank rate: {}")
    plt.tight_layout()
    ('table_contour.png')
    ()
    # Draw table lines (only for lattice method)    if  == 'lattice':
        plot = (kind='line')
        ("Table Line Detection")
        plt.tight_layout()
        ('table_lines.png')
        ()
    print("Visualized graphics saved")
    return tables
#User Examplevisualized_tables = visualize_table_extraction("quarterly_report.pdf")

3. Table data processing and cleaning

3.1 Table data cleaning

def clean_table_data(table):
    """Cleaning table data extracted from PDF"""
    # Get DataFrame    df = ()
    # 1. Replace blank cells    df = ('', )
    # 2. Clear excess spaces    for col in :
        if df[col].dtype == object:  # Process only string columns            df[col] = df[col].() if df[col].notna().any() else df[col]
    # 3. Handle the problem of merging cells (fill down)    df = (method='ffill')
    # 4. Detect and remove the header or footer (usually appearing on the first or last line)    if [0] > 2:
        # Check whether the first line is the header        if [0].astype(str).('Page|Page|Date').any():
            df = [1:]
        # Check if the last line is a footer        if [-1].astype(str).('Total|Total|Total').any():
            df = [:-1]
    # 5. Reset the index    df = df.reset_index(drop=True)
    # 6. Set the first behavior column name (optional)    #  = [0]
    # df = [1:].reset_index(drop=True)
    return df
#User Exampletables = camelot.read_pdf("financial_data.pdf")
if tables:
    cleaned_df = clean_table_data(tables[0])
    print(cleaned_df.head())

3.2 Multi-table merge

def merge_tables(tables, merge_method='vertical'):
    """Merge multiple tables"""
    if not tables or len(tables) == 0:
        return None
    dfs = [ for table in tables]
    if merge_method == 'vertical':
        # Vertical Merge (Applicable to spreadsheets)        merged_df = (dfs, ignore_index=True)
    elif merge_method == 'horizontal':
        # Horizontal merge (applicable to sub-list tables)        merged_df = (dfs, axis=1)
    else:
        raise ValueError("The merge method must be 'vertical' or 'horizontal'")
    # Clean the merged data    # Delete the exact same duplicate row (probably from the table header)    merged_df = merged_df.drop_duplicates()
    return merged_df
# Example of usage - Merge spreadsheetstables = camelot.read_pdf("multipage_report.pdf", pages='1-3')
if tables:
    merged_table = merge_tables(tables, merge_method='vertical')
    print(f"Merged table size: {merged_table.shape}")
    print(merged_table.head())

3.3 Table data type conversion

def convert_table_datatypes(df):
    """Convert tabular data to the appropriate data type"""
    # Create a DataFrame copy    df = ()
    for col in :
        # Try to convert the column to a numerical type        try:
            # Check if the column contains numbers (with currency symbols or thousand separators)            if df[col].(r'[$¥€£]|\d,\d').any():
                # Remove currency symbols and thousand separators                df[col] = df[col].replace(r'[$¥€£,]', '', regex=True)
            # Try to convert to numerical            df[col] = pd.to_numeric(df[col])
            print(f"List '{col}' Converted to numerical")
        except (ValueError, AttributeError):
            # Try to convert to date type            try:
                df[col] = pd.to_datetime(df[col])
                print(f"List '{col}' Converted to date type")
            except (ValueError, AttributeError):
                # Stay as string                pass
    return df
#User Exampletables = camelot.read_pdf("sales_report.pdf")
if tables:
    df = clean_table_data(tables[0])
    typed_df = convert_table_datatypes(df)
    print(typed_df.dtypes)

4. Practical application scenarios

4.1 Extract financial statement data

def extract_financial_statements(pdf_path, pages='all'):
    """Extract financial statements from annual reports"""
    # Extract all forms    tables = camelot.read_pdf(
        pdf_path,
        pages=pages,
        flavor='stream',
        edge_tol=500,
        row_tol=10
    )
    print(f"A total of extracted {len(tables)} A form")
    # Find financial statements (by keywords)    balance_sheet = None
    income_statement = None
    cash_flow = None
    for table in tables:
        df = 
        # Check whether the form contains specific keywords        text = ' '.join([' '.join(row) for row in ()])
        if any(term in text for term in ['Balance sheet', 'Balance Sheet', 'Statement of Financial Status']):
            balance_sheet = clean_table_data(table)
            print("Find the balance sheet")
        elif any(term in text for term in ['Income Statement', 'Income Statement', 'Profit and Loss Statement']):
            income_statement = clean_table_data(table)
            print("Find Income Statement")
        elif any(term in text for term in ['Cash Flow Statement', 'Cash Flow']):
            cash_flow = clean_table_data(table)
            print("Find Cash Flow Statement")
    return {
        'balance_sheet': balance_sheet,
        'income_statement': income_statement,
        'cash_flow': cash_flow
    }
#User Examplefinancial_data = extract_financial_statements("annual_report_2022.pdf", pages='10-30')
for statement_name, df in financial_data.items():
    if df is not None:
        print(f"\n{statement_name}:")
        print(())

4.2 Batch processing of multiple PDFs

def batch_process_pdfs(pdf_folder, output_folder='extracted_tables'):
    """Batch multiple PDF files in batches, extract all tables"""
    import os
    from pathlib import Path
    # Create an output folder    Path(output_folder).mkdir(exist_ok=True)
    # Get all PDF files    pdf_files = [f for f in (pdf_folder) if ().endswith('.pdf')]
    results = {}
    for pdf_file in pdf_files:
        pdf_path = (pdf_folder, pdf_file)
        pdf_name = (pdf_file)[0]
        print(f"\ndeal with: {pdf_file}")
        # Create PDF exclusive output folder        pdf_output_folder = (output_folder, pdf_name)
        Path(pdf_output_folder).mkdir(exist_ok=True)
        try:
            # Extract form            tables = camelot.read_pdf(pdf_path, pages='all')
            print(f"from {pdf_file} Extracted {len(tables)} A form")
            # Save each table as a CSV file            for i, table in enumerate(tables):
                df = clean_table_data(table)
                output_path = (pdf_output_folder, f"table_{i+1}.csv")
                df.to_csv(output_path, index=False, encoding='utf-8-sig')
            # Record the results            results[pdf_file] = {
                'status': 'success',
                'tables_count': len(tables),
                'output_folder': pdf_output_folder
            }
        except Exception as e:
            print(f"deal with {pdf_file} An error occurred while: {str(e)}")
            results[pdf_file] = {
                'status': 'error',
                'error_message': str(e)
            }
    # Summary Report    success_count = sum(1 for result in () if result['status'] == 'success')
    print(f"\n批deal with完成。success: {success_count}/{len(pdf_files)}")
    return results
#User Examplebatch_results = batch_process_pdfs("reports_folder", "extracted_data")

4.3 Create an interactive data dashboard

def create_dashboard_from_tables(tables, output_html='table_dashboard.html'):
    """Create a simple interactive dashboard from extracted tables"""
    import  as px
    import plotly.graph_objects as go
    from  import make_subplots
    import pandas as pd
    # Make sure we have forms    if not tables or len(tables) == 0:
        print("No table data is available for creating dashboards")
        return
    # For simplicity, use the first table    df = clean_table_data(tables[0])
    # If all columns are strings, try to convert some of them to numeric values    df = convert_table_datatypes(df)
    # Create dashboard HTML    with open(output_html, 'w', encoding='utf-8') as f:
        ("<html><head>")
        ("<title>PDF Table Data Dashboard</title>")
        ("&lt;style&gt;body {font-family: Arial; margin: 20px;} .chart {margin: 20px 0; padding: 20px; border: 1px solid #ddd;}&lt;/style&gt;")
        ("&lt;/head&gt;&lt;body&gt;")
        ("<h1>PDF Table Data Dashboard</h1>")
        # Add a table        ("&lt;div class='chart'&gt;")
        ("<h2>Extracted Table Data</h2>")
        (df.to_html(classes='dataframe', index=False))
        ("&lt;/div&gt;")
        # If there are numeric columns, create a chart        numeric_cols = df.select_dtypes(include=['number']).columns
        if len(numeric_cols) &gt; 0:
            # Select the first numeric column to create a chart            value_col = numeric_cols[0]
            # Find a possible category column            category_col = None
            for col in :
                if col != value_col and df[col].dtype == object and df[col].nunique() &lt; len(df) * 0.5:
                    category_col = col
                    break
            if category_col:
                # Create a bar chart                fig = (df, x=category_col, y=value_col, title=f"{category_col} vs {value_col}")
                ("&lt;div class='chart'&gt;")
                (f"&lt;h2&gt;{category_col} vs {value_col}&lt;/h2&gt;")
                (fig.to_html(full_html=False))
                ("&lt;/div&gt;")
                # Create a pie chart                fig = (df, names=category_col, values=value_col, title=f"{value_col} by {category_col}")
                ("&lt;div class='chart'&gt;")
                (f"&lt;h2&gt;{value_col} by {category_col} (Pie chart)&lt;/h2&gt;")
                (fig.to_html(full_html=False))
                ("&lt;/div&gt;")
        ("&lt;/body&gt;&lt;/html&gt;")
    print(f"Dashboard created: {output_html}")
    return output_html
#User Exampletables = camelot.read_pdf("sales_by_region.pdf")
if tables:
    dashboard_path = create_dashboard_from_tables(tables)

5. Advanced configuration and optimization

5.1 Optimize table detection parameters

def optimize_table_detection(pdf_path, page='1'):
    """Optimize table detection parameters, try different settings and evaluate results"""
    # Define different parameter combinations    stream_configs = [
        {'edge_tol': 50, 'row_tol': 5, 'column_tol': 5},
        {'edge_tol': 100, 'row_tol': 10, 'column_tol': 10},
        {'edge_tol': 500, 'row_tol': 15, 'column_tol': 15}
    ]
    lattice_configs = [
        {'process_background': True, 'line_scale': 15},
        {'process_background': True, 'line_scale': 40},
        {'process_background': True, 'line_scale': 60, 'iterations': 1}
    ]
    results = []
    # Test different configurations of Stream methods    print("Test the Stream method...")
    for config in stream_configs:
        try:
            tables = camelot.read_pdf(
                pdf_path,
                pages=page,
                flavor='stream',
                **config
            )
            # Evaluation results            if len(tables) &gt; 0:
                accuracy = tables[0].accuracy
                whitespace = tables[0].whitespace
                print(f"Configuration {config}: Accuracy={accuracy:.2f}, Blank rate={whitespace:.2f}")
                ({
                    'flavor': 'stream',
                    'config': config,
                    'tables_found': len(tables),
                    'accuracy': accuracy,
                    'whitespace': whitespace,
                    'tables': tables
                })
        except Exception as e:
            print(f"Configuration {config} An error occurred: {str(e)}")
    # Test different configurations of Lattice methods    print("\nTest the Lattice method...")
    for config in lattice_configs:
        try:
            tables = camelot.read_pdf(
                pdf_path,
                pages=page,
                flavor='lattice',
                **config
            )
            # Evaluation results            if len(tables) &gt; 0:
                accuracy = tables[0].accuracy
                whitespace = tables[0].whitespace
                print(f"Configuration {config}: Accuracy={accuracy:.2f}, Blank rate={whitespace:.2f}")
                ({
                    'flavor': 'lattice',
                    'config': config,
                    'tables_found': len(tables),
                    'accuracy': accuracy,
                    'whitespace': whitespace,
                    'tables': tables
                })
        except Exception as e:
            print(f"Configuration {config} An error occurred: {str(e)}")
    # Find the best configuration    if results:
        # Sort by accuracy        best_result = sorted(results, key=lambda x: x['accuracy'], reverse=True)[0]
        print(f"\n最佳Configuration: {best_result['flavor']} method, parameter: {best_result['config']}")
        print(f"Accuracy: {best_result['accuracy']:.2f}, Blank rate: {best_result['whitespace']:.2f}")
        return best_result['tables']
    return None
#User Exampleoptimized_tables = optimize_table_detection("complex_report.pdf")

5.2 Processing scan PDF

def extract_tables_from_scanned_pdf(pdf_path, page='1'):
    """Extract tables from scanned PDFs (preprocessing required)"""
    import cv2
    import numpy as np
    import tempfile
    from pdf2image import convert_from_path
    # Convert PDF page to image    images = convert_from_path(pdf_path, first_page=int(page), last_page=int(page))
    if not images:
        print("Cannot convert PDF page to image")
        return None
    # Get the first page image    image = (images[0])
    # Image preprocessing    gray = (image, cv2.COLOR_BGR2GRAY)
    thresh = (gray, 150, 255, cv2.THRESH_BINARY)[1]
    # Save the processed image    with (suffix='.png', delete=False) as tmp:
        temp_image_path = 
        (temp_image_path, thresh)
    print(f"The scan page has been preprocessed and saved as a temporary image: {temp_image_path}")
    # Extract tables from images    tables = camelot.read_pdf(
        pdf_path,
        pages=page,
        flavor='lattice',
        process_background=True,
        line_scale=150
    )
    print(f"From ScanPDFExtracted {len(tables)} A form")
    # Optional: Remove temporary files    import os
    (temp_image_path)
    return tables
#User Examplescanned_tables = extract_tables_from_scanned_pdf("scanned_report.pdf")

5.3 Processing merged cells

def handle_merged_cells(table):
    """Processing merged cells in table"""
    # Get DataFrame    df = ()
    # Detect and process vertically merged cells    for col in :
        # Fill values ​​downwards on consecutive empty cells        mask = df[col].eq('')
        if ():
            prev_value = None
            fill_values = []
            for idx, is_empty in enumerate(mask):
                if not is_empty:
                    prev_value = [idx, col]
                elif prev_value is not None:
                    fill_values.append((idx, prev_value))
            # Fill in the detected merged cells            for idx, value in fill_values:
                [idx, col] = value
    # Detect and process horizontally merged cells    for idx, row in ():
        empty_cols = [row == ''].tolist()
        if empty_cols and idx &gt; 0:
            # Check if there are empty cells followed by non-empty cells in this row            for i, col in enumerate(empty_cols):
                if i + 1 &lt; len(row) and [i + 1] != '':
                    # Probably horizontal merge, fill from the left cell                    left_col_idx = .get_loc(col) - 1
                    if left_col_idx &gt;= 0 and [left_col_idx] != '':
                        [idx, col] = [left_col_idx]
    return df
#User Exampletables = camelot.read_pdf("report_with_merged_cells.pdf")
if tables:
    cleaned_df = handle_merged_cells(tables[0])
    print(cleaned_df.head())

6. Integrate with other tools

6.1 Deep integration with pandas

def analyze_extracted_table(table):
    """Analyzing extracted tabular data using pandas"""
    # Clean data    df = clean_table_data(table)
    # Convert data types    df = convert_table_datatypes(df)
    # Basic Statistical Analysis    print("\n==== Basic statistical analysis ====")
    numeric_cols = df.select_dtypes(include=['number']).columns
    if len(numeric_cols) &gt; 0:
        print(df[numeric_cols].describe())
    # Check for missing values    print("\n==== Missing value analysis ====")
    missing = ().sum()
    print(missing[missing &gt; 0])
    # Category variable analysis    print("\n==== Category variable analysis ====")
    categorical_cols = df.select_dtypes(include=['object']).columns
    for col in categorical_cols[:3]:  # Show only the first three categories        value_counts = df[col].value_counts()
        print(f"\n{col}:")
        print(value_counts.head())
    # Correlation Analysis    if len(numeric_cols) &gt;= 2:
        print("\n==== Correlation Analysis ====")
        correlation = df[numeric_cols].corr()
        print(correlation)
    return df
#User Exampletables = camelot.read_pdf("sales_data.pdf")
if tables:
    analyzed_df = analyze_extracted_table(tables[0])

6.2 Visualization with matplotlib and seaborn

def visualize_table_data(table, output_prefix='table_viz'):
    """Visualize tabular data using matplotlib and seaborn"""
    import  as plt
    import seaborn as sns
    # Set style    (style="whitegrid")
    # Clean and convert data    df = clean_table_data(table)
    df = convert_table_datatypes(df)
    # Get the numeric column    numeric_cols = df.select_dtypes(include=['number']).columns
    if len(numeric_cols) == 0:
        print("No numeric columns are available for visualization")
        return
    # 1. Heat Map - Relevance    if len(numeric_cols) &gt;= 2:
        (figsize=(10, 8))
        corr = df[numeric_cols].corr()
        mask = (np.ones_like(corr, dtype=bool))
        (corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm', 
                   square=True, linewidths=.5)
        ('Relevance Heat Map', fontsize=15)
        plt.tight_layout()
        (f'{output_prefix}_correlation.png')
        ()
    # 2. Bar Chart - Numerical Distribution    for col in numeric_cols[:3]:  # Only process the first three numeric columns        (figsize=(12, 6))
        (x=, y=df[col])
        (f'{col} distributed', fontsize=15)
        (rotation=45)
        plt.tight_layout()
        (f'{output_prefix}_{col}_barplot.png')
        ()
    # 3. Box chart - Find out the outlier    (figsize=(12, 6))
    (data=df[numeric_cols])
    ('Numerical column box chart', fontsize=15)
    plt.tight_layout()
    (f'{output_prefix}_boxplot.png')
    ()
    # 4. Scatter plot matrix - variable relationship    if len(numeric_cols) &gt;= 2 and len(df) &gt; 5:
        (df[numeric_cols])
        ('Scatter plot matrix', y=1.02, fontsize=15)
        plt.tight_layout()
        (f'{output_prefix}_pairplot.png')
        ()
    print(f"Visual chart saved,Prefix: {output_prefix}")
    return df
#User Exampletables = camelot.read_pdf("product_sales.pdf")
if tables:
    df = visualize_table_data(tables[0], output_prefix='sales_viz')

6.3 Integration with Excel

def export_tables_to_excel(tables, output_path='extracted_tables.xlsx'):
    """Export extracted tables to different worksheets in Excel workbooks"""
    import pandas as pd
    # Check if there are tables    if not tables or len(tables) == 0:
        print("No tables to export")
        return None
    # Create Excel Writer Object    with (output_path, engine='xlsxwriter') as writer:
        workbook = 
        # Create a table style        header_format = workbook.add_format({
            'bold': True,
            'text_wrap': True,
            'valign': 'top',
            'fg_color': '#D7E4BC',
            'border': 1
        })
        cell_format = workbook.add_format({
            'border': 1
        })
        # Create worksheets for each table        for i, table in enumerate(tables):
            # Clean data            df = clean_table_data(table)
            # Write data            sheet_name = f'Table_{i+1}'
            df.to_excel(writer, sheet_name=sheet_name, index=False)
            # Get the worksheet object            worksheet = [sheet_name]
            # Set column width            for j, col in enumerate():
                column_width = max(
                    df[col].astype(str).map(len).max(),
                    len(col)
                ) + 2
                worksheet.set_column(j, j, column_width)
            # Set the table format            worksheet.add_table(0, 0, [0], [1] - 1, {
                'columns': [{'header': col} for col in ],
                'style': 'Table Style Medium 9',
                'header_row': True
            })
            # Add table metadata            ([0] + 2, 0, f"page number: {}")
            ([0] + 3, 0, f"Table area: {}")
            ([0] + 4, 0, f"Accuracy score: {}")
    print(f"The table has been exported toExcel: {output_path}")
    return output_path
#User Exampletables = camelot.read_pdf("quarterly_report.pdf", pages='all')
if tables:
    excel_path = export_tables_to_excel(tables, "quarterly_report_tables.xlsx")

7. Performance optimization and best practices

7.1 Handling large PDF documents

def process_large_pdf(pdf_path, batch_size=5, output_folder='large_pdf_tables'):
    """Batch large PDF documents to save memory"""
    import os
    from pathlib import Path
    # Create an output folder    Path(output_folder).mkdir(exist_ok=True)
    # Get the number of PDF pages first    with (pdf_path) as pdf:
        total_pages = len()
    print(f"PDFThere are a total of {total_pages} Page")
    # Batch page    all_tables_count = 0
    for start_page in range(1, total_pages + 1, batch_size):
        end_page = min(start_page + batch_size - 1, total_pages)
        page_range = f"{start_page}-{end_page}"
        print(f"处理Page面范围: {page_range}")
        try:
            # Extract the table of the current batch            tables = camelot.read_pdf(
                pdf_path,
                pages=page_range,
                flavor='stream',
                edge_tol=500
            )
            batch_tables_count = len(tables)
            all_tables_count += batch_tables_count
            print(f"从Page面 {page_range} Extracted {batch_tables_count} A form")
            # Save this batch of tables            for i, table in enumerate(tables):
                table_index = all_tables_count - batch_tables_count + i + 1
                df = clean_table_data(table)
                output_path = (output_folder, f"table_{table_index}_p{}.csv")
                df.to_csv(output_path, index=False, encoding='utf-8-sig')
            # explicitly release memory            tables = None
            import gc
            ()
        except Exception as e:
            print(f"处理Page面 {page_range} An error occurred while: {str(e)}")
    print(f"Processing is completed,共Extracted {all_tables_count} A form,Save to {output_folder}")
    return all_tables_count
#User Exampletable_count = process_large_pdf("very_large_report.pdf", batch_size=10)

7.2 Camelot performance tuning

def optimize_camelot_performance(pdf_path, page='1'):
    """Tuning Camelot's performance parameters"""
    import time
    import psutil
    import os
    def measure_performance(func, *args, **kwargs):
        """Measure the execution time and memory usage of a function"""
        process = (())
        mem_before = process.memory_info().rss / 1024 / 1024  # MB
        start_time = ()
        result = func(*args, **kwargs)
        end_time = ()
        mem_after = process.memory_info().rss / 1024 / 1024  # MB
        execution_time = end_time - start_time
        memory_used = mem_after - mem_before
        return result, execution_time, memory_used
    # Test different parameter combinations    configs = [
        {
            'name': 'Default configuration',
            'params': {}
        },
        {
            'name': 'Enable background processing',
            'params': {'process_background': True}
        },
        {
            'name': 'Disable line detection',
            'params': {'line_scale': 0}
        },
        {
            'name': 'Improving line detection sensitivity',
            'params': {'line_scale': 80}
        }
    ]
    results = []
    for config in configs:
        print(f"\nTest configuration: {config['name']}")
        # Test the Lattice method        try:
            lattice_func = lambda: camelot.read_pdf(
                pdf_path, 
                pages=page, 
                flavor='lattice',
                **config['params']
            )
            lattice_tables, lattice_time, lattice_mem = measure_performance(lattice_func)
            ({
                'config_name': config['name'],
                'method': 'Lattice',
                'time': lattice_time,
                'memory': lattice_mem,
                'tables_count': len(lattice_tables),
                'accuracy': lattice_tables[0].accuracy if len(lattice_tables) &gt; 0 else 0
            })
            print(f"  Lattice - time: {lattice_time:.2f}Second, Memory: {lattice_mem:.2f}MB, Accuracy: {lattice_tables[0].accuracy if len(lattice_tables) &gt; 0 else 0}")
        except Exception as e:
            print(f"  LatticeAn error occurred in method: {str(e)}")
        # Test Stream method        try:
            stream_func = lambda: camelot.read_pdf(
                pdf_path, 
                pages=page, 
                flavor='stream',
                **config['params']
            )
            stream_tables, stream_time, stream_mem = measure_performance(stream_func)
            ({
                'config_name': config['name'],
                'method': 'Stream',
                'time': stream_time,
                'memory': stream_mem,
                'tables_count': len(stream_tables),
                'accuracy': stream_tables[0].accuracy if len(stream_tables) &gt; 0 else 0
            })
            print(f"  Stream - time: {stream_time:.2f}Second, Memory: {stream_mem:.2f}MB, Accuracy: {stream_tables[0].accuracy if len(stream_tables) &gt; 0 else 0}")
        except Exception as e:
            print(f"  StreamAn error occurred in method: {str(e)}")
    # Find the best performance configuration    if results:
        # Sort by accuracy        accuracy_best = sorted(results, key=lambda x: x['accuracy'], reverse=True)[0]
        print(f"\n最高Accuracy配置: {accuracy_best['config_name']} / {accuracy_best['method']}")
        print(f"  Accuracy: {accuracy_best['accuracy']:.2f}, time consuming: {accuracy_best['time']:.2f}Second")
        # Sort by time        time_best = sorted(results, key=lambda x: x['time'])[0]
        print(f"\nFastest configuration: {time_best['config_name']} / {time_best['method']}")
        print(f"  time consuming: {time_best['time']:.2f}Second, Accuracy: {time_best['accuracy']:.2f}")
        # Sort by memory usage        memory_best = sorted(results, key=lambda x: x['memory'])[0]
        print(f"\n最低Memory配置: {memory_best['config_name']} / {memory_best['method']}")
        print(f"  Memory: {memory_best['memory']:.2f}MB, Accuracy: {memory_best['accuracy']:.2f}")
        # Optimal configuration that takes into account speed and accuracy        balanced = sorted(results, key=lambda x: (1/x['accuracy']) * x['time'])[0]
        print(f"\nBalanced configuration: {balanced['config_name']} / {balanced['method']}")
        print(f"  Accuracy: {balanced['accuracy']:.2f}, time consuming: {balanced['time']:.2f}Second")
        return balanced
    return None
#User Examplebest_config = optimize_camelot_performance("sample_report.pdf")

8. Compare the differences with other tools

8.1 Camelot vs. PyPDF2/PyPDF4

def compare_with_pypdf(pdf_path, page=0):
    """Compare the extraction capabilities of Camelot with PyPDF2"""
    import PyPDF2
    print("\n===== PyPDF2Extract results =====")
    try:
        # Extract text using PyPDF2        with open(pdf_path, 'rb') as file:
            reader = (file)
            if page &lt; len():
                text = [page].extract_text()
                print(f"Extracted text ({len(text)} character):")
                print(text[:500] + "..." if len(text) &gt; 500 else text)
                print("\nPyPDF2 cannot recognize the table structure, only plain text can be extracted")
            else:
                print(f"page number {page} Out of range")
    except Exception as e:
        print(f"PyPDF2Extraction error: {str(e)}")
    print("\n===== CamelotExtract results =====")
    try:
        # Use Camelot to extract tables        tables = camelot.read_pdf(pdf_path, pages=str(page+1))  # Camelot page number starts at 1        print(f"Detected {len(tables)} A form")
        if len(tables) &gt; 0:
            table = tables[0]
            print(f"Table dimensions: {}")
            print(f"Accuracy: {}")
            print("\nTable Preview:")
            print(().to_string())
            print("\nCamelot can identify table structures and preserve row-column relationships")
    except Exception as e:
        print(f"CamelotExtraction error: {str(e)}")
    return None
#User Examplecompare_with_pypdf("financial_data.pdf")

8.2 Camelot vs. Tabula

def compare_with_tabula(pdf_path, page='1'):
    """Compare the table extraction capabilities of Camelot and Tabula"""
    try:
        import tabula
    except ImportError:
        print("Please install tabula-py: pip install tabula-py")
        return
    print("\n===== TabulaExtract results =====")
    try:
        # Extract tables using Tabula        tabula_tables = tabula.read_pdf(pdf_path, pages=page)
        print(f"Detected {len(tabula_tables)} A form")
        if len(tabula_tables) &gt; 0:
            tabula_df = tabula_tables[0]
            print(f"Table dimensions: {tabula_df.shape}")
            print("\nTable Preview:")
            print(tabula_df.head().to_string())
    except Exception as e:
        print(f"TabulaExtraction error: {str(e)}")
    print("\n===== CamelotExtract results =====")
    try:
        # Use Camelot to extract tables        camelot_tables = camelot.read_pdf(pdf_path, pages=page)
        print(f"Detected {len(camelot_tables)} A form")
        if len(camelot_tables) &gt; 0:
            camelot_df = camelot_tables[0].df
            print(f"Table dimensions: {camelot_df.shape}")
            print(f"Accuracy: {camelot_tables[0].accuracy}")
            print("\nTable Preview:")
            print(camelot_df.head().to_string())
    except Exception as e:
        print(f"CamelotExtraction error: {str(e)}")
    # Compare results    if 'tabula_tables' in locals() and 'camelot_tables' in locals():
        if len(tabula_tables) &gt; 0 and len(camelot_tables) &gt; 0:
            tabula_df = tabula_tables[0]
            camelot_df = camelot_tables[0].df
            print("\n===== Comparison results =====")
            print(f"TabulaTable size: {tabula_df.shape}")
            print(f"CamelotTable size: {camelot_df.shape}")
            # Check whether the same column count is extracted            if tabula_df.shape[1] != camelot_df.shape[1]:
                print(f"Different column count: Tabula={tabula_df.shape[1]}, Camelot={camelot_df.shape[1]}")
                print("This may indicate that one of the tools better recognizes the table structure")
            # Check whether the same row count is extracted            if tabula_df.shape[0] != camelot_df.shape[0]:
                print(f"Different row counts: Tabula={tabula_df.shape[0]}, Camelot={camelot_df.shape[0]}")
                print("This may indicate that one of the tools better recognizes table boundaries")
    return None
#User Examplecompare_with_tabula("complex_table.pdf")

8.3 Camelot vs. pdfplumber

def compare_with_pdfplumber(pdf_path, page=0):
    """Compare the table extraction capabilities of Camelot and pdfplumber"""
    try:
        import pdfplumber
    except ImportError:
        print("Please install pdfplumber: pip install pdfplumber")
        return
    print("\n===== pdfplumberExtract results =====")
    try:
        # Use pdfplumber to extract the table        with (pdf_path) as pdf:
            if page &lt; len():
                plumber_page = [page]
                plumber_tables = plumber_page.extract_tables()
                print(f"Detected {len(plumber_tables)} A form")
                if len(plumber_tables) &gt; 0:
                    plumber_table = plumber_tables[0]
                    plumber_df = (plumber_table[1:], columns=plumber_table[0])
                    print(f"Table dimensions: {plumber_df.shape}")
                    print("\nTable Preview:")
                    print(plumber_df.head().to_string())
            else:
                print(f"page number {page} Out of range")
    except Exception as e:
        print(f"pdfplumberExtraction error: {str(e)}")
    print("\n===== CamelotExtract results =====")
    try:
        # Use Camelot to extract tables        camelot_tables = camelot.read_pdf(pdf_path, pages=str(page+1))  # Camelot page number starts at 1        print(f"Detected {len(camelot_tables)} A form")
        if len(camelot_tables) &gt; 0:
            camelot_df = camelot_tables[0].df
            print(f"Table dimensions: {camelot_df.shape}")
            print(f"Accuracy: {camelot_tables[0].accuracy}")
            print("\nTable Preview:")
            print(camelot_df.head().to_string())
    except Exception as e:
        print(f"CamelotExtraction error: {str(e)}")
    return None
#User Examplecompare_with_pdfplumber("annual_report.pdf")

9. Troubleshooting and FAQs

9.1 Solve the extraction problem

def diagnose_extraction_issues(pdf_path, page='1'):
    """Diagnose and resolve table extraction problems"""
    # Check whether the PDF is accessible    try:
        with open(pdf_path, 'rb') as f:
            pass
    except Exception as e:
        print(f"Unable to accessPDFdocument: {str(e)}")
        return
    # Check whether it is a scan PDF    import fitz  # PyMuPDF
    try:
        doc = (pdf_path)
        page_obj = doc[int(page) - 1]
        text = page_obj.get_text()
        if len(()) &lt; 50:
            print("It may be a scan PDF or an image PDF detected")
            print("Suggestions: Use OCR software to convert PDFs to searchable PDFs first")
        # Check page rotation        rotation = page_obj.rotation
        if rotation != 0:
            print(f"The page has rotated {rotation} Spend")
            print("Suggestions: Use PyMuPDF or other tools to rotate the PDF page to the normal direction first")
    except Exception as e:
        print(f"examinePDFAn error occurred while formatting: {str(e)}")
    # Try using different extraction methods    print("\nTry using a different Camelot configuration...")
    # Try the Lattice method    try:
        print("\nUse the Lattice method:")
        lattice_tables = camelot.read_pdf(
            pdf_path,
            pages=page,
            flavor='lattice'
        )
        if len(lattice_tables) &gt; 0:
            print(f"Successfully extracted {len(lattice_tables)} A form")
            print(f"准确Spend: {lattice_tables[0].accuracy}")
        else:
            print("Not detected")
            print("Suggestions: Try to adjust the line_scale parameters and table areas")
    except Exception as e:
        print(f"LatticeAn error occurred in method: {str(e)}")
    # Try the Stream method    try:
        print("\nUse Stream method:")
        stream_tables = camelot.read_pdf(
            pdf_path,
            pages=page,
            flavor='stream'
        )
        if len(stream_tables) &gt; 0:
            print(f"Successfully extracted {len(stream_tables)} A form")
            print(f"准确Spend: {stream_tables[0].accuracy}")
        else:
            print("Not detected")
            print("Suggestions: Try specifying a table area")
    except Exception as e:
        print(f"StreamAn error occurred in method: {str(e)}")
    # suggestion    print("\n==== General recommendations ====")
    print("1. If both methods fail, try specifying the table area")
    print("2. For PDFs with obvious table lines, use the Lattice method first and adjust the line_scale")
    print("3. For PDFs without table lines, use the Stream method first and adjust the edge tolerance")
    print("4. Try to convert PDF pages to images and then preprocess them with OpenCV before extracting")
    print("5. If you are scanning PDF, consider using OCR software for processing first")
    return None
#User Examplediagnose_extraction_issues("problematic_report.pdf")

9.2 Common Errors and Solutions

def common_errors_guide():
    """Providing a guide to solving common errors in Camelot"""
    errors = {
        "ImportError: No module named 'cv2'": {
            "reason": "Missing OpenCV dependencies",
            "Solution": "Run pip install opencv-python"
        },
        "File does not exist": {
            "reason": "File path error",
            "Solution": "Check that the file path is correct, including case and spaces"
        },
        "OCR engine not reachable": {
            "reason": "Trying to use OCR but not install Tesseract",
            "Solution": "Installing Tesseract OCR and making sure it's in the system path"
        },
        "Invalid page range specified": {
            "reason": "The specified page number is out of PDF range",
            "Solution": "Make sure the page number is within the document page count range, and the page number of Camelot starts at 1"
        },
        "Unable to process background": {
            "reason": "I'm having issues dealing with background, usually related to GhostScript",
            "Solution": "examineGhostScriptIs it installed correctly,Or try to disable background processing (process_background=False)"
        },
        "No tables found on page": {
            "reason": "Camelot cannot detect table on the specified page",
            "Solution": [
                "1. Try another extraction method (lattice or stream)",
                "2. Manually specify the table area",
                "3. Adjust detection parameters (line_scale, edge_tolwait)",
                "4. examinePDFIs it a scanned version?,If so, please use it firstOCRdeal with"
            ]
        }
    }
    print("==== CamelotCommon Errors and Solutions ====\n")
    for error, info in ():
        print(f"mistake: {error}")
        print(f"reason: {info['reason']}")
        if isinstance(info['Solution'], list):
            print("Solution:")
            for solution in info['Solution']:
                print(f"  {solution}")
        else:
            print(f"Solution: {info['Solution']}")
        print()
    print("==== General advice ====")
    print("1. Always use the latest version of Camelot and its dependencies")
    print("2. For complex tables, try to analyze the table structure and specify the area manually")
    print("3. Verify table boundary detection using visualization tools")
    print("4. For large PDFs, consider processing pages by batch")
    print("5. If one extraction method fails, try another method")
    return None
#User Examplecommon_errors_guide()

10. Summary and Outlook

As a professional PDF form extraction tool, Camelot provides powerful solutions for data analysts and developers. Through the techniques described in this article, you can:

  • Extract table data accurately from PDF documents, including complex tables and scanned documents
  • Choose the most suitable extraction method (Lattice or Stream) according to different table types (Lattice or Stream)
  • Clean and process extracted table data to solve common problems such as merging cells
  • Integrate into the data analysis process and seamlessly cooperate with pandas, matplotlib and other tools
  • Optimize extraction performance and process large PDF documents
  • Create automated data extraction pipelines to batch process multiple PDF files

With the increasing demand for data analysis, the importance of data extraction in PDF tables is becoming increasingly prominent. In the future, we can expect the following development trends:

  • Improve table detection and structural understanding in combination with deep learning
  • Improve processing capabilities for complex layouts and multilingual tables
  • Smarter data type recognition and semantic understanding
  • In-depth integration with the automation workflow platform
  • The popularization of cloud services and API interfaces makes table extraction more convenient

Mastering PDF table data extraction technology can not only improve work efficiency, but also mine valuable business value from the data "locked" in PDF files in the past. Hope this article helps you take advantage of the power of Camelot to efficiently and accurately obtain table data from PDF documents.

Reference resources

Camelot official documentation:/

Camelot GitHub repository:/camelot-dev/camelot

Pandas official documentation:/docs/

Ghostscript:/

OpenCV:/

Appendix: Table extraction parameter reference

# Lattice method parameter referencelattice_params = {
    'line_scale': 15,       # Line detection sensitivity, the higher the value, the fewer lines are detected    'copy_text': [],        # The text area to copy from the PDF    'shift_text': [],       # The text area to be moved    'line_margin': 2,       # Line detection interval tolerance    'joint_tol': 2,         # Connection point tolerance    'threshold_blocksize': 15, # Block size of adaptive threshold    'threshold_constant': -2,  # Constant of adaptive threshold    'iterations': 0,        # The number of iterations of morphological operations    'resolution': 300,      # PDF-to-PNG conversion DPI    'process_background': False, # Whether to process background    'table_areas': [],      # Table area list, format [x1,y1,x2,y2]    'table_regions': []     # Table area name}
# Stream method parameter referencestream_params = {
    'table_areas': [],      # Table area list    'columns': [],          # Column coordinates    'row_tol': 2,           #Train tolerance    'column_tol': 0,        # Column tolerance    'edge_tol': 50,         # Edge tolerance    'split_text': False,    # Whether to split text, experimental functions    'flag_size': False,     # Whether to mark text size    'strip_text': '',       # Characters to be deleted from text    'edge_segment_counts': 50, # Used to detect the number of line segments at the edge of the table    'min_columns': 1,       # Minimum number of columns    'max_columns': 0,       # Maximum number of columns, 0 means no limit    'split_columns': False, # Whether to split columns, experimental functions    'process_background': False, # Whether to process background    'line_margin': 2,       # Line detection interval tolerance    'joint_tol': 2,         # Connection point tolerance    'threshold_blocksize': 15, # Block size of adaptive threshold    'threshold_constant': -2,  # Constant of adaptive threshold    'iterations': 0,        # The number of iterations of morphological operations    'resolution': 300       # PDF-to-PNG conversion DPI}

By mastering Camelot's skills, you will be able to efficiently extract table data from various PDF documents, providing strong support for data analysis and automation processes.

The above is the detailed content of Python using Camelot to accurately obtain table data from PDF. For more information about Python's accurate data from PDF, please pay attention to my other related articles!