Python Pandas's complete guide to efficiently process Excel data

In the data-driven era, Excel is still a tool for a large number of enterprises to store core data, but its manual operation mode has dropped sharply when processing more than 10,000 rows of data. Python's Pandas library has become the first choice for automated analysis of Excel data with its vectorized computing, memory optimization and rich data processing interfaces. This article will use technical analysis and practical cases to show how to use 50 lines of code to complete the work that takes several hours of traditional Excel operations.

1. Environment construction and data reading

1.1 Basic environment configuration

# Recommended environment: Anaconda suite (Pandas/OpenPyXL has been integrated)# or install via pippip install pandas openpyxl xlrd

Key dependency description:

openpyxl: read and write.xlsx format
xlrd: Read the old version of .xls format (xlsx is no longer supported in version 2.0+)

1.2 Efficient data loading skills

import pandas as pd
 
# Basic readingdf = pd.read_excel('sales_data.xlsx', sheet_name='Sheet1')
 
# Advanced parameter exampledf = pd.read_excel(
    'large_file.xlsx',
    nrows=10000,          # Only read the first 10,000 rows    usecols='C:F',        # Read columns C to F    dtype={'Order number': str}  # Specify column data type)

Performance comparison: When reading 100,000 rows of data, Pandas is 8-12 times faster than Excel VBA, and the memory usage is reduced by 60%

2. Core tactics for data cleaning

2.1 Missing value processing matrix

Scene	Solution	Pandas implementation
Numerical missing	Mean/median fill	(())
Missing categorical variables	Mode fill	(().iloc[0])
Key fields are missing	Delete the entire line	(subset=['Order Amount'])
Time series missing	Forward filling	(method='ffill')

Advanced tips: Use where conditions to fill

df['Inventory'] = df['Inventory'].where(df['Inventory']>0, 0) # Put negative inventory to zero

2.2 Repeat value governance

# Detect duplicatesduplicates = df[(subset=['Order number', 'Product ID'])]
 
# Intelligent deduplication (keep the latest records)df.sort_values('Order time', inplace=True)
df.drop_duplicates(subset=['Order number'], keep='last', inplace=True)

2.3 Data type conversion

# string to date (handle Excel date format confusion)df['Order Date'] = pd.to_datetime(
    df['Order Date'],
    format='%Y/%m/%d',  # specify the format explicitly    errors='coerce'     # Invalid resolution to NaT)
 
# Numerical normalization (processing scientific notation)df['Product ID'] = df['Product ID'].astype('str').(10)

3. Practical cases of data processing

3.1 Sales Pivot Vision Analysis

Demand: Statistics of sales, order volume, and customer order price in each region and product category

pivot = df.pivot_table(
    index='Sales Area',
    columns='Product Category',
    values='Order Amount',
    aggfunc={
        'Order Amount': 'sum',
        'Order number': 'count'
    },
    fill_value=0
)
 
# Calculate customer order pricepivot['User price'] = pivot['Order Amount'] / pivot['Order number']

3.2 Outlier value detection

Methodology:

Numerical type: Use standard deviation method (>3σ is anomaly)
Classified variables: Chi-square test

# Numerical abnormality detection examplez_scores = (df['Order Amount'] - df['Order Amount'].mean()) / df['Order Amount'].std()
outliers = df[z_scores.abs() &gt; 3]
 
# Classification exception detection (need to install `pandas-profiling`)# pip install pandas-profiling
import pandas_profiling
profile = pandas_profiling.ProfileReport(df)
profile.to_file("")

3.3 Cross-table correlation analysis

Scenario: Merge order details list and customer information list

orders = pd.read_excel('')
customers = pd.read_excel('')
 
#Left connection (retain all orders)merged = (
    orders,
    customers[['Customer ID', 'Customer Level', 'Area to which it belongs']],
    on='Customer ID',
    how='left'
)

4. Performance optimization secrets

4.1 Large file processing solution

# Block reading processing (suitable for 500MB+ files)chunk_size = 50000
chunks = []
for chunk in pd.read_excel('huge_data.xlsx', chunksize=chunk_size):
    # Clean each chunk    chunk = clean_data(chunk)
    (chunk)
df = (chunks)

4.2 Memory optimization skills

# Convert data types to save memorydf['Order number'] = df['Order number'].astype('category')  # Category Typedf['Order Amount'] = df['Order Amount'].astype('float32') # Floating point number reduction accuracy 
# Delete intermediate variablesdel chunk
import gc
()  # Forced garbage collection

5. Automated report generation

5.1 Basic report output

# Generate an analysis summaryreport = f"""
=== Sales Data Overview ===
Total order count: {len(df):,}
Total sales: {df['Order Amount'].sum():,.2f}
Average customer order price: {df['Order Amount'].mean():,.2f}
"""
 
with open('', 'w') as f:
    (report)
 
# Export processed datadf.to_excel('cleaned_data.xlsx', index=False)

5.2 Visual Integration (Matplotlib Example)

import  as plt
 
# Sales Trend Analysismonthly_sales = ('M', on='Order Date')['Order Amount'].sum()
 
(figsize=(12,6))
monthly_sales.plot(kind='bar', color='skyblue')
('Monthly Sales Trends')
('month')
('Sales (10,000 yuan)')
('sales_trend.png', dpi=300, bbox_inches='tight')

6. Analysis of typical application scenarios

6.1 Financial reconciliation automation

process:

Read bank statements Excel
Convert date format
Match internal company transaction records
Generate a difference report

Code snippet:

bank_df = pd.read_excel('bank_statement.xlsx')
internal_df = pd.read_excel('internal_records.xlsx')
 
merged = (
    bank_df,
    internal_df,
    left_on=['Trading time', 'Amount'],
    right_on=['Accounting time', 'The amount of occurrence'],
    how='outer',
    indicator=True
)
 
unmatched = merged[merged['_merge'] != 'both']

6.2 Inventory warning system

logic:

Set a safe inventory threshold

Calculate turnover rate

Generate a replenishment list

inventory = pd.read_excel('')
 
# Safe inventory calculation (consider procurement cycle)inventory['Safe Inventory'] = inventory['Average daily sales'] * 7
inventory['Inventory Status'] = (
    inventory['Current inventory'] &lt; inventory['Safe Inventory'],
    'Replenishment is required',
    'normal'
)
 
alert = inventory[inventory['Inventory Status'] == 'Replenishment is required']

Conclusion: From tools to thinking upgrade

Pandas is not only a substitute for Excel, but also a carrier of data analysis thinking. By mastering core concepts such as vectorized operations, data alignment, and hierarchical indexing, analysts can:

Free 80% of the time from repeated operations
Easily process millions of row-level data
Build an automated analysis pipeline

In the future, with the development of libraries such as Dask and Modin, the Pandas ecosystem will continue to break through the bottleneck of stand-alone performance and truly realize the new era of data analysis with "Excel advancement and Python empowerment". .

The above is the detailed content of the complete guide for Python Pandas to efficiently process Excel data. For more information about Python Pandas to process Excel, please follow my other related articles!