Preface - Why PDF table data extraction is so important
In the field of data analysis and business intelligence, the tabular data in PDF documents is a huge "gold mine", but it has become a "nightmare" for data practitioners because of its closed format. From corporate financial reports to government statistics, from scientific research papers to market research reports, key information is often locked in PDF forms and cannot be used directly for analysis. Traditional methods such as manual copy and paste are not only inefficient, but also prone to errors; general PDF parsing tools are often unable to handle complex tables. As a Python library designed specifically for PDF table extraction, Camelot has become a right-hand assistant for data professionals with its precise table recognition capabilities and flexible configuration options. This article will give you a comprehensive introduction to the skills of using Camelot, from basic installation to advanced applications, and help you master the professional skills of PDF table data extraction.
1. Getting started with Camelot basics
1.1 Installation and Environment Configuration
The installation of Camelot is very simple, but some dependencies need to be paid attention to:
# Basic installationpip install camelot-py[cv] # If PDF conversion function is requiredpip install ghostscript
For full functionality, make sure to install the following dependencies:
- Ghostscript: for PDF file processing
- OpenCV: for image processing and table detection
- Tkinter: for visualization functions (optional)
On Windows systems, you also need to install Ghostscript separately and add it to the system path.
Basic import:
import camelot import pandas as pd import as plt import cv2
1.2 Basic table extraction
def extract_basic_tables(pdf_path, pages='1'): """Extract basic tables from PDF""" # Use stream mode to extract tables tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream') print(f"Detected {len(tables)} A form") # Basic information of the form for i, table in enumerate(tables): print(f"\nsheet #{i+1}:") print(f"page number: {}") print(f"sheet区域: {}") print(f"Dimension: {}") print(f"Accuracy score: {}") print(f"Blank rate: {}") # Show the first few rows of the table print("\nTable Preview:") print(()) return tables #User Exampletables = extract_basic_tables("financial_report.pdf", pages='1-3')
1.3 Comparison of Extraction Methods Stream vs Lattice
def compare_extraction_methods(pdf_path, page='1'): """Compare two extraction methods of Stream and Lattice""" # Use Stream method stream_tables = camelot.read_pdf(pdf_path, pages=page, flavor='stream') # Use the Lattice method lattice_tables = camelot.read_pdf(pdf_path, pages=page, flavor='lattice') # Compare results print(f"Streammethod: Detected {len(stream_tables)} A form") print(f"Latticemethod: Detected {len(lattice_tables)} A form") # If a table is detected, compare the first table if len(stream_tables) > 0 and len(lattice_tables) > 0: # Get the first form stream_table = stream_tables[0] lattice_table = lattice_tables[0] # Compare accuracy and blank rate print("\nComparison of accuracy and blank rate:") print(f"Stream - Accuracy: {stream_table.accuracy}, Blank rate: {stream_table.whitespace}") print(f"Lattice - Accuracy: {lattice_table.accuracy}, Blank rate: {lattice_table.whitespace}") # Compare table shapes print("\nTable dimension comparison:") print(f"Stream: {stream_table.shape}") print(f"Lattice: {lattice_table.shape}") # Returns the table of two methods return stream_tables, lattice_tables return None, None #User Examplestream_tables, lattice_tables = compare_extraction_methods("report_with_tables.pdf")
2. Advanced form extraction technology
2.1 Accurately locate table areas
def extract_table_with_area(pdf_path, page='1', table_area=None): """Extract tables using precise area coordinates""" if table_area is None: # The default value covers the entire page table_area = [0, 0, 100, 100] # [x1, y1, x2, y2] expressed as percentage # Use the Stream method to extract the table in the specified area tables = camelot.read_pdf( pdf_path, pages=page, flavor='stream', table_areas=[f"{table_area[0]},{table_area[1]},{table_area[2]},{table_area[3]}"] ) print(f"Detected in the specified area {len(tables)} A form") # Show the first table if len(tables) > 0: print("\nTable Preview:") print(tables[0].()) return tables # Use example - Extract the table at approximately the middle of the pagetables = extract_table_with_area("financial_report.pdf", table_area=[10, 30, 90, 70])
2.2 Handling complex tables
def extract_complex_tables(pdf_path, page='1'): """Advanced configuration for handling complex tables""" # Use the Lattice method to handle complex tables with borders lattice_tables = camelot.read_pdf( pdf_path, pages=page, flavor='lattice', line_scale=40, # Adjust line detection sensitivity process_background=True, # Processing background line_margin=2 # Line interval tolerance ) # Use Stream method to handle complex tables without borders stream_tables = camelot.read_pdf( pdf_path, pages=page, flavor='stream', edge_tol=500, # Edge tolerance row_tol=10, #Train tolerance column_tol=10 # Column tolerance ) print(f"Latticemethod: Detected {len(lattice_tables)} A form") print(f"Streammethod: Detected {len(stream_tables)} A form") # Choose the best result best_tables = lattice_tables if lattice_tables[0].accuracy > stream_tables[0].accuracy else stream_tables return best_tables #User Examplecomplex_tables = extract_complex_tables("complex_financial_report.pdf")
2.3 Table visualization and debugging
def visualize_table_extraction(pdf_path, page='1'): """Visualize the table extraction process to help debug and optimize""" # Extract form tables = camelot.read_pdf(pdf_path, pages=page) # Check whether the form is successfully extracted if len(tables) == 0: print("Not detected") return # Get the first form table = tables[0] # Show the table print(f"Table shape: {}") print(f"Accuracy: {}") # Draw the table structure plot = (kind='grid') (f"Table grid structure - Accuracy: {}") plt.tight_layout() ('table_grid.png') () # Draw table cells plot = (kind='contour') (f"Table cell structure - Blank rate: {}") plt.tight_layout() ('table_contour.png') () # Draw table lines (only for lattice method) if == 'lattice': plot = (kind='line') ("Table Line Detection") plt.tight_layout() ('table_lines.png') () print("Visualized graphics saved") return tables #User Examplevisualized_tables = visualize_table_extraction("quarterly_report.pdf")
3. Table data processing and cleaning
3.1 Table data cleaning
def clean_table_data(table): """Cleaning table data extracted from PDF""" # Get DataFrame df = () # 1. Replace blank cells df = ('', ) # 2. Clear excess spaces for col in : if df[col].dtype == object: # Process only string columns df[col] = df[col].() if df[col].notna().any() else df[col] # 3. Handle the problem of merging cells (fill down) df = (method='ffill') # 4. Detect and remove the header or footer (usually appearing on the first or last line) if [0] > 2: # Check whether the first line is the header if [0].astype(str).('Page|Page|Date').any(): df = [1:] # Check if the last line is a footer if [-1].astype(str).('Total|Total|Total').any(): df = [:-1] # 5. Reset the index df = df.reset_index(drop=True) # 6. Set the first behavior column name (optional) # = [0] # df = [1:].reset_index(drop=True) return df #User Exampletables = camelot.read_pdf("financial_data.pdf") if tables: cleaned_df = clean_table_data(tables[0]) print(cleaned_df.head())
3.2 Multi-table merge
def merge_tables(tables, merge_method='vertical'): """Merge multiple tables""" if not tables or len(tables) == 0: return None dfs = [ for table in tables] if merge_method == 'vertical': # Vertical Merge (Applicable to spreadsheets) merged_df = (dfs, ignore_index=True) elif merge_method == 'horizontal': # Horizontal merge (applicable to sub-list tables) merged_df = (dfs, axis=1) else: raise ValueError("The merge method must be 'vertical' or 'horizontal'") # Clean the merged data # Delete the exact same duplicate row (probably from the table header) merged_df = merged_df.drop_duplicates() return merged_df # Example of usage - Merge spreadsheetstables = camelot.read_pdf("multipage_report.pdf", pages='1-3') if tables: merged_table = merge_tables(tables, merge_method='vertical') print(f"Merged table size: {merged_table.shape}") print(merged_table.head())
3.3 Table data type conversion
def convert_table_datatypes(df): """Convert tabular data to the appropriate data type""" # Create a DataFrame copy df = () for col in : # Try to convert the column to a numerical type try: # Check if the column contains numbers (with currency symbols or thousand separators) if df[col].(r'[$¥€£]|\d,\d').any(): # Remove currency symbols and thousand separators df[col] = df[col].replace(r'[$¥€£,]', '', regex=True) # Try to convert to numerical df[col] = pd.to_numeric(df[col]) print(f"List '{col}' Converted to numerical") except (ValueError, AttributeError): # Try to convert to date type try: df[col] = pd.to_datetime(df[col]) print(f"List '{col}' Converted to date type") except (ValueError, AttributeError): # Stay as string pass return df #User Exampletables = camelot.read_pdf("sales_report.pdf") if tables: df = clean_table_data(tables[0]) typed_df = convert_table_datatypes(df) print(typed_df.dtypes)
4. Practical application scenarios
4.1 Extract financial statement data
def extract_financial_statements(pdf_path, pages='all'): """Extract financial statements from annual reports""" # Extract all forms tables = camelot.read_pdf( pdf_path, pages=pages, flavor='stream', edge_tol=500, row_tol=10 ) print(f"A total of extracted {len(tables)} A form") # Find financial statements (by keywords) balance_sheet = None income_statement = None cash_flow = None for table in tables: df = # Check whether the form contains specific keywords text = ' '.join([' '.join(row) for row in ()]) if any(term in text for term in ['Balance sheet', 'Balance Sheet', 'Statement of Financial Status']): balance_sheet = clean_table_data(table) print("Find the balance sheet") elif any(term in text for term in ['Income Statement', 'Income Statement', 'Profit and Loss Statement']): income_statement = clean_table_data(table) print("Find Income Statement") elif any(term in text for term in ['Cash Flow Statement', 'Cash Flow']): cash_flow = clean_table_data(table) print("Find Cash Flow Statement") return { 'balance_sheet': balance_sheet, 'income_statement': income_statement, 'cash_flow': cash_flow } #User Examplefinancial_data = extract_financial_statements("annual_report_2022.pdf", pages='10-30') for statement_name, df in financial_data.items(): if df is not None: print(f"\n{statement_name}:") print(())
4.2 Batch processing of multiple PDFs
def batch_process_pdfs(pdf_folder, output_folder='extracted_tables'): """Batch multiple PDF files in batches, extract all tables""" import os from pathlib import Path # Create an output folder Path(output_folder).mkdir(exist_ok=True) # Get all PDF files pdf_files = [f for f in (pdf_folder) if ().endswith('.pdf')] results = {} for pdf_file in pdf_files: pdf_path = (pdf_folder, pdf_file) pdf_name = (pdf_file)[0] print(f"\ndeal with: {pdf_file}") # Create PDF exclusive output folder pdf_output_folder = (output_folder, pdf_name) Path(pdf_output_folder).mkdir(exist_ok=True) try: # Extract form tables = camelot.read_pdf(pdf_path, pages='all') print(f"from {pdf_file} Extracted {len(tables)} A form") # Save each table as a CSV file for i, table in enumerate(tables): df = clean_table_data(table) output_path = (pdf_output_folder, f"table_{i+1}.csv") df.to_csv(output_path, index=False, encoding='utf-8-sig') # Record the results results[pdf_file] = { 'status': 'success', 'tables_count': len(tables), 'output_folder': pdf_output_folder } except Exception as e: print(f"deal with {pdf_file} An error occurred while: {str(e)}") results[pdf_file] = { 'status': 'error', 'error_message': str(e) } # Summary Report success_count = sum(1 for result in () if result['status'] == 'success') print(f"\n批deal with完成。success: {success_count}/{len(pdf_files)}") return results #User Examplebatch_results = batch_process_pdfs("reports_folder", "extracted_data")
4.3 Create an interactive data dashboard
def create_dashboard_from_tables(tables, output_html='table_dashboard.html'): """Create a simple interactive dashboard from extracted tables""" import as px import plotly.graph_objects as go from import make_subplots import pandas as pd # Make sure we have forms if not tables or len(tables) == 0: print("No table data is available for creating dashboards") return # For simplicity, use the first table df = clean_table_data(tables[0]) # If all columns are strings, try to convert some of them to numeric values df = convert_table_datatypes(df) # Create dashboard HTML with open(output_html, 'w', encoding='utf-8') as f: ("<html><head>") ("<title>PDF Table Data Dashboard</title>") ("<style>body {font-family: Arial; margin: 20px;} .chart {margin: 20px 0; padding: 20px; border: 1px solid #ddd;}</style>") ("</head><body>") ("<h1>PDF Table Data Dashboard</h1>") # Add a table ("<div class='chart'>") ("<h2>Extracted Table Data</h2>") (df.to_html(classes='dataframe', index=False)) ("</div>") # If there are numeric columns, create a chart numeric_cols = df.select_dtypes(include=['number']).columns if len(numeric_cols) > 0: # Select the first numeric column to create a chart value_col = numeric_cols[0] # Find a possible category column category_col = None for col in : if col != value_col and df[col].dtype == object and df[col].nunique() < len(df) * 0.5: category_col = col break if category_col: # Create a bar chart fig = (df, x=category_col, y=value_col, title=f"{category_col} vs {value_col}") ("<div class='chart'>") (f"<h2>{category_col} vs {value_col}</h2>") (fig.to_html(full_html=False)) ("</div>") # Create a pie chart fig = (df, names=category_col, values=value_col, title=f"{value_col} by {category_col}") ("<div class='chart'>") (f"<h2>{value_col} by {category_col} (Pie chart)</h2>") (fig.to_html(full_html=False)) ("</div>") ("</body></html>") print(f"Dashboard created: {output_html}") return output_html #User Exampletables = camelot.read_pdf("sales_by_region.pdf") if tables: dashboard_path = create_dashboard_from_tables(tables)
5. Advanced configuration and optimization
5.1 Optimize table detection parameters
def optimize_table_detection(pdf_path, page='1'): """Optimize table detection parameters, try different settings and evaluate results""" # Define different parameter combinations stream_configs = [ {'edge_tol': 50, 'row_tol': 5, 'column_tol': 5}, {'edge_tol': 100, 'row_tol': 10, 'column_tol': 10}, {'edge_tol': 500, 'row_tol': 15, 'column_tol': 15} ] lattice_configs = [ {'process_background': True, 'line_scale': 15}, {'process_background': True, 'line_scale': 40}, {'process_background': True, 'line_scale': 60, 'iterations': 1} ] results = [] # Test different configurations of Stream methods print("Test the Stream method...") for config in stream_configs: try: tables = camelot.read_pdf( pdf_path, pages=page, flavor='stream', **config ) # Evaluation results if len(tables) > 0: accuracy = tables[0].accuracy whitespace = tables[0].whitespace print(f"Configuration {config}: Accuracy={accuracy:.2f}, Blank rate={whitespace:.2f}") ({ 'flavor': 'stream', 'config': config, 'tables_found': len(tables), 'accuracy': accuracy, 'whitespace': whitespace, 'tables': tables }) except Exception as e: print(f"Configuration {config} An error occurred: {str(e)}") # Test different configurations of Lattice methods print("\nTest the Lattice method...") for config in lattice_configs: try: tables = camelot.read_pdf( pdf_path, pages=page, flavor='lattice', **config ) # Evaluation results if len(tables) > 0: accuracy = tables[0].accuracy whitespace = tables[0].whitespace print(f"Configuration {config}: Accuracy={accuracy:.2f}, Blank rate={whitespace:.2f}") ({ 'flavor': 'lattice', 'config': config, 'tables_found': len(tables), 'accuracy': accuracy, 'whitespace': whitespace, 'tables': tables }) except Exception as e: print(f"Configuration {config} An error occurred: {str(e)}") # Find the best configuration if results: # Sort by accuracy best_result = sorted(results, key=lambda x: x['accuracy'], reverse=True)[0] print(f"\n最佳Configuration: {best_result['flavor']} method, parameter: {best_result['config']}") print(f"Accuracy: {best_result['accuracy']:.2f}, Blank rate: {best_result['whitespace']:.2f}") return best_result['tables'] return None #User Exampleoptimized_tables = optimize_table_detection("complex_report.pdf")
5.2 Processing scan PDF
def extract_tables_from_scanned_pdf(pdf_path, page='1'): """Extract tables from scanned PDFs (preprocessing required)""" import cv2 import numpy as np import tempfile from pdf2image import convert_from_path # Convert PDF page to image images = convert_from_path(pdf_path, first_page=int(page), last_page=int(page)) if not images: print("Cannot convert PDF page to image") return None # Get the first page image image = (images[0]) # Image preprocessing gray = (image, cv2.COLOR_BGR2GRAY) thresh = (gray, 150, 255, cv2.THRESH_BINARY)[1] # Save the processed image with (suffix='.png', delete=False) as tmp: temp_image_path = (temp_image_path, thresh) print(f"The scan page has been preprocessed and saved as a temporary image: {temp_image_path}") # Extract tables from images tables = camelot.read_pdf( pdf_path, pages=page, flavor='lattice', process_background=True, line_scale=150 ) print(f"From ScanPDFExtracted {len(tables)} A form") # Optional: Remove temporary files import os (temp_image_path) return tables #User Examplescanned_tables = extract_tables_from_scanned_pdf("scanned_report.pdf")
5.3 Processing merged cells
def handle_merged_cells(table): """Processing merged cells in table""" # Get DataFrame df = () # Detect and process vertically merged cells for col in : # Fill values downwards on consecutive empty cells mask = df[col].eq('') if (): prev_value = None fill_values = [] for idx, is_empty in enumerate(mask): if not is_empty: prev_value = [idx, col] elif prev_value is not None: fill_values.append((idx, prev_value)) # Fill in the detected merged cells for idx, value in fill_values: [idx, col] = value # Detect and process horizontally merged cells for idx, row in (): empty_cols = [row == ''].tolist() if empty_cols and idx > 0: # Check if there are empty cells followed by non-empty cells in this row for i, col in enumerate(empty_cols): if i + 1 < len(row) and [i + 1] != '': # Probably horizontal merge, fill from the left cell left_col_idx = .get_loc(col) - 1 if left_col_idx >= 0 and [left_col_idx] != '': [idx, col] = [left_col_idx] return df #User Exampletables = camelot.read_pdf("report_with_merged_cells.pdf") if tables: cleaned_df = handle_merged_cells(tables[0]) print(cleaned_df.head())
6. Integrate with other tools
6.1 Deep integration with pandas
def analyze_extracted_table(table): """Analyzing extracted tabular data using pandas""" # Clean data df = clean_table_data(table) # Convert data types df = convert_table_datatypes(df) # Basic Statistical Analysis print("\n==== Basic statistical analysis ====") numeric_cols = df.select_dtypes(include=['number']).columns if len(numeric_cols) > 0: print(df[numeric_cols].describe()) # Check for missing values print("\n==== Missing value analysis ====") missing = ().sum() print(missing[missing > 0]) # Category variable analysis print("\n==== Category variable analysis ====") categorical_cols = df.select_dtypes(include=['object']).columns for col in categorical_cols[:3]: # Show only the first three categories value_counts = df[col].value_counts() print(f"\n{col}:") print(value_counts.head()) # Correlation Analysis if len(numeric_cols) >= 2: print("\n==== Correlation Analysis ====") correlation = df[numeric_cols].corr() print(correlation) return df #User Exampletables = camelot.read_pdf("sales_data.pdf") if tables: analyzed_df = analyze_extracted_table(tables[0])
6.2 Visualization with matplotlib and seaborn
def visualize_table_data(table, output_prefix='table_viz'): """Visualize tabular data using matplotlib and seaborn""" import as plt import seaborn as sns # Set style (style="whitegrid") # Clean and convert data df = clean_table_data(table) df = convert_table_datatypes(df) # Get the numeric column numeric_cols = df.select_dtypes(include=['number']).columns if len(numeric_cols) == 0: print("No numeric columns are available for visualization") return # 1. Heat Map - Relevance if len(numeric_cols) >= 2: (figsize=(10, 8)) corr = df[numeric_cols].corr() mask = (np.ones_like(corr, dtype=bool)) (corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm', square=True, linewidths=.5) ('Relevance Heat Map', fontsize=15) plt.tight_layout() (f'{output_prefix}_correlation.png') () # 2. Bar Chart - Numerical Distribution for col in numeric_cols[:3]: # Only process the first three numeric columns (figsize=(12, 6)) (x=, y=df[col]) (f'{col} distributed', fontsize=15) (rotation=45) plt.tight_layout() (f'{output_prefix}_{col}_barplot.png') () # 3. Box chart - Find out the outlier (figsize=(12, 6)) (data=df[numeric_cols]) ('Numerical column box chart', fontsize=15) plt.tight_layout() (f'{output_prefix}_boxplot.png') () # 4. Scatter plot matrix - variable relationship if len(numeric_cols) >= 2 and len(df) > 5: (df[numeric_cols]) ('Scatter plot matrix', y=1.02, fontsize=15) plt.tight_layout() (f'{output_prefix}_pairplot.png') () print(f"Visual chart saved,Prefix: {output_prefix}") return df #User Exampletables = camelot.read_pdf("product_sales.pdf") if tables: df = visualize_table_data(tables[0], output_prefix='sales_viz')
6.3 Integration with Excel
def export_tables_to_excel(tables, output_path='extracted_tables.xlsx'): """Export extracted tables to different worksheets in Excel workbooks""" import pandas as pd # Check if there are tables if not tables or len(tables) == 0: print("No tables to export") return None # Create Excel Writer Object with (output_path, engine='xlsxwriter') as writer: workbook = # Create a table style header_format = workbook.add_format({ 'bold': True, 'text_wrap': True, 'valign': 'top', 'fg_color': '#D7E4BC', 'border': 1 }) cell_format = workbook.add_format({ 'border': 1 }) # Create worksheets for each table for i, table in enumerate(tables): # Clean data df = clean_table_data(table) # Write data sheet_name = f'Table_{i+1}' df.to_excel(writer, sheet_name=sheet_name, index=False) # Get the worksheet object worksheet = [sheet_name] # Set column width for j, col in enumerate(): column_width = max( df[col].astype(str).map(len).max(), len(col) ) + 2 worksheet.set_column(j, j, column_width) # Set the table format worksheet.add_table(0, 0, [0], [1] - 1, { 'columns': [{'header': col} for col in ], 'style': 'Table Style Medium 9', 'header_row': True }) # Add table metadata ([0] + 2, 0, f"page number: {}") ([0] + 3, 0, f"Table area: {}") ([0] + 4, 0, f"Accuracy score: {}") print(f"The table has been exported toExcel: {output_path}") return output_path #User Exampletables = camelot.read_pdf("quarterly_report.pdf", pages='all') if tables: excel_path = export_tables_to_excel(tables, "quarterly_report_tables.xlsx")
7. Performance optimization and best practices
7.1 Handling large PDF documents
def process_large_pdf(pdf_path, batch_size=5, output_folder='large_pdf_tables'): """Batch large PDF documents to save memory""" import os from pathlib import Path # Create an output folder Path(output_folder).mkdir(exist_ok=True) # Get the number of PDF pages first with (pdf_path) as pdf: total_pages = len() print(f"PDFThere are a total of {total_pages} Page") # Batch page all_tables_count = 0 for start_page in range(1, total_pages + 1, batch_size): end_page = min(start_page + batch_size - 1, total_pages) page_range = f"{start_page}-{end_page}" print(f"处理Page面范围: {page_range}") try: # Extract the table of the current batch tables = camelot.read_pdf( pdf_path, pages=page_range, flavor='stream', edge_tol=500 ) batch_tables_count = len(tables) all_tables_count += batch_tables_count print(f"从Page面 {page_range} Extracted {batch_tables_count} A form") # Save this batch of tables for i, table in enumerate(tables): table_index = all_tables_count - batch_tables_count + i + 1 df = clean_table_data(table) output_path = (output_folder, f"table_{table_index}_p{}.csv") df.to_csv(output_path, index=False, encoding='utf-8-sig') # explicitly release memory tables = None import gc () except Exception as e: print(f"处理Page面 {page_range} An error occurred while: {str(e)}") print(f"Processing is completed,共Extracted {all_tables_count} A form,Save to {output_folder}") return all_tables_count #User Exampletable_count = process_large_pdf("very_large_report.pdf", batch_size=10)
7.2 Camelot performance tuning
def optimize_camelot_performance(pdf_path, page='1'): """Tuning Camelot's performance parameters""" import time import psutil import os def measure_performance(func, *args, **kwargs): """Measure the execution time and memory usage of a function""" process = (()) mem_before = process.memory_info().rss / 1024 / 1024 # MB start_time = () result = func(*args, **kwargs) end_time = () mem_after = process.memory_info().rss / 1024 / 1024 # MB execution_time = end_time - start_time memory_used = mem_after - mem_before return result, execution_time, memory_used # Test different parameter combinations configs = [ { 'name': 'Default configuration', 'params': {} }, { 'name': 'Enable background processing', 'params': {'process_background': True} }, { 'name': 'Disable line detection', 'params': {'line_scale': 0} }, { 'name': 'Improving line detection sensitivity', 'params': {'line_scale': 80} } ] results = [] for config in configs: print(f"\nTest configuration: {config['name']}") # Test the Lattice method try: lattice_func = lambda: camelot.read_pdf( pdf_path, pages=page, flavor='lattice', **config['params'] ) lattice_tables, lattice_time, lattice_mem = measure_performance(lattice_func) ({ 'config_name': config['name'], 'method': 'Lattice', 'time': lattice_time, 'memory': lattice_mem, 'tables_count': len(lattice_tables), 'accuracy': lattice_tables[0].accuracy if len(lattice_tables) > 0 else 0 }) print(f" Lattice - time: {lattice_time:.2f}Second, Memory: {lattice_mem:.2f}MB, Accuracy: {lattice_tables[0].accuracy if len(lattice_tables) > 0 else 0}") except Exception as e: print(f" LatticeAn error occurred in method: {str(e)}") # Test Stream method try: stream_func = lambda: camelot.read_pdf( pdf_path, pages=page, flavor='stream', **config['params'] ) stream_tables, stream_time, stream_mem = measure_performance(stream_func) ({ 'config_name': config['name'], 'method': 'Stream', 'time': stream_time, 'memory': stream_mem, 'tables_count': len(stream_tables), 'accuracy': stream_tables[0].accuracy if len(stream_tables) > 0 else 0 }) print(f" Stream - time: {stream_time:.2f}Second, Memory: {stream_mem:.2f}MB, Accuracy: {stream_tables[0].accuracy if len(stream_tables) > 0 else 0}") except Exception as e: print(f" StreamAn error occurred in method: {str(e)}") # Find the best performance configuration if results: # Sort by accuracy accuracy_best = sorted(results, key=lambda x: x['accuracy'], reverse=True)[0] print(f"\n最高Accuracy配置: {accuracy_best['config_name']} / {accuracy_best['method']}") print(f" Accuracy: {accuracy_best['accuracy']:.2f}, time consuming: {accuracy_best['time']:.2f}Second") # Sort by time time_best = sorted(results, key=lambda x: x['time'])[0] print(f"\nFastest configuration: {time_best['config_name']} / {time_best['method']}") print(f" time consuming: {time_best['time']:.2f}Second, Accuracy: {time_best['accuracy']:.2f}") # Sort by memory usage memory_best = sorted(results, key=lambda x: x['memory'])[0] print(f"\n最低Memory配置: {memory_best['config_name']} / {memory_best['method']}") print(f" Memory: {memory_best['memory']:.2f}MB, Accuracy: {memory_best['accuracy']:.2f}") # Optimal configuration that takes into account speed and accuracy balanced = sorted(results, key=lambda x: (1/x['accuracy']) * x['time'])[0] print(f"\nBalanced configuration: {balanced['config_name']} / {balanced['method']}") print(f" Accuracy: {balanced['accuracy']:.2f}, time consuming: {balanced['time']:.2f}Second") return balanced return None #User Examplebest_config = optimize_camelot_performance("sample_report.pdf")
8. Compare the differences with other tools
8.1 Camelot vs. PyPDF2/PyPDF4
def compare_with_pypdf(pdf_path, page=0): """Compare the extraction capabilities of Camelot with PyPDF2""" import PyPDF2 print("\n===== PyPDF2Extract results =====") try: # Extract text using PyPDF2 with open(pdf_path, 'rb') as file: reader = (file) if page < len(): text = [page].extract_text() print(f"Extracted text ({len(text)} character):") print(text[:500] + "..." if len(text) > 500 else text) print("\nPyPDF2 cannot recognize the table structure, only plain text can be extracted") else: print(f"page number {page} Out of range") except Exception as e: print(f"PyPDF2Extraction error: {str(e)}") print("\n===== CamelotExtract results =====") try: # Use Camelot to extract tables tables = camelot.read_pdf(pdf_path, pages=str(page+1)) # Camelot page number starts at 1 print(f"Detected {len(tables)} A form") if len(tables) > 0: table = tables[0] print(f"Table dimensions: {}") print(f"Accuracy: {}") print("\nTable Preview:") print(().to_string()) print("\nCamelot can identify table structures and preserve row-column relationships") except Exception as e: print(f"CamelotExtraction error: {str(e)}") return None #User Examplecompare_with_pypdf("financial_data.pdf")
8.2 Camelot vs. Tabula
def compare_with_tabula(pdf_path, page='1'): """Compare the table extraction capabilities of Camelot and Tabula""" try: import tabula except ImportError: print("Please install tabula-py: pip install tabula-py") return print("\n===== TabulaExtract results =====") try: # Extract tables using Tabula tabula_tables = tabula.read_pdf(pdf_path, pages=page) print(f"Detected {len(tabula_tables)} A form") if len(tabula_tables) > 0: tabula_df = tabula_tables[0] print(f"Table dimensions: {tabula_df.shape}") print("\nTable Preview:") print(tabula_df.head().to_string()) except Exception as e: print(f"TabulaExtraction error: {str(e)}") print("\n===== CamelotExtract results =====") try: # Use Camelot to extract tables camelot_tables = camelot.read_pdf(pdf_path, pages=page) print(f"Detected {len(camelot_tables)} A form") if len(camelot_tables) > 0: camelot_df = camelot_tables[0].df print(f"Table dimensions: {camelot_df.shape}") print(f"Accuracy: {camelot_tables[0].accuracy}") print("\nTable Preview:") print(camelot_df.head().to_string()) except Exception as e: print(f"CamelotExtraction error: {str(e)}") # Compare results if 'tabula_tables' in locals() and 'camelot_tables' in locals(): if len(tabula_tables) > 0 and len(camelot_tables) > 0: tabula_df = tabula_tables[0] camelot_df = camelot_tables[0].df print("\n===== Comparison results =====") print(f"TabulaTable size: {tabula_df.shape}") print(f"CamelotTable size: {camelot_df.shape}") # Check whether the same column count is extracted if tabula_df.shape[1] != camelot_df.shape[1]: print(f"Different column count: Tabula={tabula_df.shape[1]}, Camelot={camelot_df.shape[1]}") print("This may indicate that one of the tools better recognizes the table structure") # Check whether the same row count is extracted if tabula_df.shape[0] != camelot_df.shape[0]: print(f"Different row counts: Tabula={tabula_df.shape[0]}, Camelot={camelot_df.shape[0]}") print("This may indicate that one of the tools better recognizes table boundaries") return None #User Examplecompare_with_tabula("complex_table.pdf")
8.3 Camelot vs. pdfplumber
def compare_with_pdfplumber(pdf_path, page=0): """Compare the table extraction capabilities of Camelot and pdfplumber""" try: import pdfplumber except ImportError: print("Please install pdfplumber: pip install pdfplumber") return print("\n===== pdfplumberExtract results =====") try: # Use pdfplumber to extract the table with (pdf_path) as pdf: if page < len(): plumber_page = [page] plumber_tables = plumber_page.extract_tables() print(f"Detected {len(plumber_tables)} A form") if len(plumber_tables) > 0: plumber_table = plumber_tables[0] plumber_df = (plumber_table[1:], columns=plumber_table[0]) print(f"Table dimensions: {plumber_df.shape}") print("\nTable Preview:") print(plumber_df.head().to_string()) else: print(f"page number {page} Out of range") except Exception as e: print(f"pdfplumberExtraction error: {str(e)}") print("\n===== CamelotExtract results =====") try: # Use Camelot to extract tables camelot_tables = camelot.read_pdf(pdf_path, pages=str(page+1)) # Camelot page number starts at 1 print(f"Detected {len(camelot_tables)} A form") if len(camelot_tables) > 0: camelot_df = camelot_tables[0].df print(f"Table dimensions: {camelot_df.shape}") print(f"Accuracy: {camelot_tables[0].accuracy}") print("\nTable Preview:") print(camelot_df.head().to_string()) except Exception as e: print(f"CamelotExtraction error: {str(e)}") return None #User Examplecompare_with_pdfplumber("annual_report.pdf")
9. Troubleshooting and FAQs
9.1 Solve the extraction problem
def diagnose_extraction_issues(pdf_path, page='1'): """Diagnose and resolve table extraction problems""" # Check whether the PDF is accessible try: with open(pdf_path, 'rb') as f: pass except Exception as e: print(f"Unable to accessPDFdocument: {str(e)}") return # Check whether it is a scan PDF import fitz # PyMuPDF try: doc = (pdf_path) page_obj = doc[int(page) - 1] text = page_obj.get_text() if len(()) < 50: print("It may be a scan PDF or an image PDF detected") print("Suggestions: Use OCR software to convert PDFs to searchable PDFs first") # Check page rotation rotation = page_obj.rotation if rotation != 0: print(f"The page has rotated {rotation} Spend") print("Suggestions: Use PyMuPDF or other tools to rotate the PDF page to the normal direction first") except Exception as e: print(f"examinePDFAn error occurred while formatting: {str(e)}") # Try using different extraction methods print("\nTry using a different Camelot configuration...") # Try the Lattice method try: print("\nUse the Lattice method:") lattice_tables = camelot.read_pdf( pdf_path, pages=page, flavor='lattice' ) if len(lattice_tables) > 0: print(f"Successfully extracted {len(lattice_tables)} A form") print(f"准确Spend: {lattice_tables[0].accuracy}") else: print("Not detected") print("Suggestions: Try to adjust the line_scale parameters and table areas") except Exception as e: print(f"LatticeAn error occurred in method: {str(e)}") # Try the Stream method try: print("\nUse Stream method:") stream_tables = camelot.read_pdf( pdf_path, pages=page, flavor='stream' ) if len(stream_tables) > 0: print(f"Successfully extracted {len(stream_tables)} A form") print(f"准确Spend: {stream_tables[0].accuracy}") else: print("Not detected") print("Suggestions: Try specifying a table area") except Exception as e: print(f"StreamAn error occurred in method: {str(e)}") # suggestion print("\n==== General recommendations ====") print("1. If both methods fail, try specifying the table area") print("2. For PDFs with obvious table lines, use the Lattice method first and adjust the line_scale") print("3. For PDFs without table lines, use the Stream method first and adjust the edge tolerance") print("4. Try to convert PDF pages to images and then preprocess them with OpenCV before extracting") print("5. If you are scanning PDF, consider using OCR software for processing first") return None #User Examplediagnose_extraction_issues("problematic_report.pdf")
9.2 Common Errors and Solutions
def common_errors_guide(): """Providing a guide to solving common errors in Camelot""" errors = { "ImportError: No module named 'cv2'": { "reason": "Missing OpenCV dependencies", "Solution": "Run pip install opencv-python" }, "File does not exist": { "reason": "File path error", "Solution": "Check that the file path is correct, including case and spaces" }, "OCR engine not reachable": { "reason": "Trying to use OCR but not install Tesseract", "Solution": "Installing Tesseract OCR and making sure it's in the system path" }, "Invalid page range specified": { "reason": "The specified page number is out of PDF range", "Solution": "Make sure the page number is within the document page count range, and the page number of Camelot starts at 1" }, "Unable to process background": { "reason": "I'm having issues dealing with background, usually related to GhostScript", "Solution": "examineGhostScriptIs it installed correctly,Or try to disable background processing (process_background=False)" }, "No tables found on page": { "reason": "Camelot cannot detect table on the specified page", "Solution": [ "1. Try another extraction method (lattice or stream)", "2. Manually specify the table area", "3. Adjust detection parameters (line_scale, edge_tolwait)", "4. examinePDFIs it a scanned version?,If so, please use it firstOCRdeal with" ] } } print("==== CamelotCommon Errors and Solutions ====\n") for error, info in (): print(f"mistake: {error}") print(f"reason: {info['reason']}") if isinstance(info['Solution'], list): print("Solution:") for solution in info['Solution']: print(f" {solution}") else: print(f"Solution: {info['Solution']}") print() print("==== General advice ====") print("1. Always use the latest version of Camelot and its dependencies") print("2. For complex tables, try to analyze the table structure and specify the area manually") print("3. Verify table boundary detection using visualization tools") print("4. For large PDFs, consider processing pages by batch") print("5. If one extraction method fails, try another method") return None #User Examplecommon_errors_guide()
10. Summary and Outlook
As a professional PDF form extraction tool, Camelot provides powerful solutions for data analysts and developers. Through the techniques described in this article, you can:
- Extract table data accurately from PDF documents, including complex tables and scanned documents
- Choose the most suitable extraction method (Lattice or Stream) according to different table types (Lattice or Stream)
- Clean and process extracted table data to solve common problems such as merging cells
- Integrate into the data analysis process and seamlessly cooperate with pandas, matplotlib and other tools
- Optimize extraction performance and process large PDF documents
- Create automated data extraction pipelines to batch process multiple PDF files
With the increasing demand for data analysis, the importance of data extraction in PDF tables is becoming increasingly prominent. In the future, we can expect the following development trends:
- Improve table detection and structural understanding in combination with deep learning
- Improve processing capabilities for complex layouts and multilingual tables
- Smarter data type recognition and semantic understanding
- In-depth integration with the automation workflow platform
- The popularization of cloud services and API interfaces makes table extraction more convenient
Mastering PDF table data extraction technology can not only improve work efficiency, but also mine valuable business value from the data "locked" in PDF files in the past. Hope this article helps you take advantage of the power of Camelot to efficiently and accurately obtain table data from PDF documents.
Reference resources
Camelot official documentation:/
Camelot GitHub repository:/camelot-dev/camelot
Pandas official documentation:/docs/
Ghostscript:/
OpenCV:/
Appendix: Table extraction parameter reference
# Lattice method parameter referencelattice_params = { 'line_scale': 15, # Line detection sensitivity, the higher the value, the fewer lines are detected 'copy_text': [], # The text area to copy from the PDF 'shift_text': [], # The text area to be moved 'line_margin': 2, # Line detection interval tolerance 'joint_tol': 2, # Connection point tolerance 'threshold_blocksize': 15, # Block size of adaptive threshold 'threshold_constant': -2, # Constant of adaptive threshold 'iterations': 0, # The number of iterations of morphological operations 'resolution': 300, # PDF-to-PNG conversion DPI 'process_background': False, # Whether to process background 'table_areas': [], # Table area list, format [x1,y1,x2,y2] 'table_regions': [] # Table area name} # Stream method parameter referencestream_params = { 'table_areas': [], # Table area list 'columns': [], # Column coordinates 'row_tol': 2, #Train tolerance 'column_tol': 0, # Column tolerance 'edge_tol': 50, # Edge tolerance 'split_text': False, # Whether to split text, experimental functions 'flag_size': False, # Whether to mark text size 'strip_text': '', # Characters to be deleted from text 'edge_segment_counts': 50, # Used to detect the number of line segments at the edge of the table 'min_columns': 1, # Minimum number of columns 'max_columns': 0, # Maximum number of columns, 0 means no limit 'split_columns': False, # Whether to split columns, experimental functions 'process_background': False, # Whether to process background 'line_margin': 2, # Line detection interval tolerance 'joint_tol': 2, # Connection point tolerance 'threshold_blocksize': 15, # Block size of adaptive threshold 'threshold_constant': -2, # Constant of adaptive threshold 'iterations': 0, # The number of iterations of morphological operations 'resolution': 300 # PDF-to-PNG conversion DPI}
By mastering Camelot's skills, you will be able to efficiently extract table data from various PDF documents, providing strong support for data analysis and automation processes.
The above is the detailed content of Python using Camelot to accurately obtain table data from PDF. For more information about Python's accurate data from PDF, please pay attention to my other related articles!