In the data-driven era, Excel is still a tool for a large number of enterprises to store core data, but its manual operation mode has dropped sharply when processing more than 10,000 rows of data. Python's Pandas library has become the first choice for automated analysis of Excel data with its vectorized computing, memory optimization and rich data processing interfaces. This article will use technical analysis and practical cases to show how to use 50 lines of code to complete the work that takes several hours of traditional Excel operations.
1. Environment construction and data reading
1.1 Basic environment configuration
# Recommended environment: Anaconda suite (Pandas/OpenPyXL has been integrated)# or install via pippip install pandas openpyxl xlrd
Key dependency description:
- openpyxl: read and write.xlsx format
- xlrd: Read the old version of .xls format (xlsx is no longer supported in version 2.0+)
1.2 Efficient data loading skills
import pandas as pd # Basic readingdf = pd.read_excel('sales_data.xlsx', sheet_name='Sheet1') # Advanced parameter exampledf = pd.read_excel( 'large_file.xlsx', nrows=10000, # Only read the first 10,000 rows usecols='C:F', # Read columns C to F dtype={'Order number': str} # Specify column data type)
Performance comparison: When reading 100,000 rows of data, Pandas is 8-12 times faster than Excel VBA, and the memory usage is reduced by 60%
2. Core tactics for data cleaning
2.1 Missing value processing matrix
Scene | Solution | Pandas implementation |
---|---|---|
Numerical missing | Mean/median fill | (()) |
Missing categorical variables | Mode fill | (().iloc[0]) |
Key fields are missing | Delete the entire line | (subset=['Order Amount']) |
Time series missing | Forward filling | (method='ffill') |
Advanced tips: Use where conditions to fill
df['Inventory'] = df['Inventory'].where(df['Inventory']>0, 0) # Put negative inventory to zero
2.2 Repeat value governance
# Detect duplicatesduplicates = df[(subset=['Order number', 'Product ID'])] # Intelligent deduplication (keep the latest records)df.sort_values('Order time', inplace=True) df.drop_duplicates(subset=['Order number'], keep='last', inplace=True)
2.3 Data type conversion
# string to date (handle Excel date format confusion)df['Order Date'] = pd.to_datetime( df['Order Date'], format='%Y/%m/%d', # specify the format explicitly errors='coerce' # Invalid resolution to NaT) # Numerical normalization (processing scientific notation)df['Product ID'] = df['Product ID'].astype('str').(10)
3. Practical cases of data processing
3.1 Sales Pivot Vision Analysis
Demand: Statistics of sales, order volume, and customer order price in each region and product category
pivot = df.pivot_table( index='Sales Area', columns='Product Category', values='Order Amount', aggfunc={ 'Order Amount': 'sum', 'Order number': 'count' }, fill_value=0 ) # Calculate customer order pricepivot['User price'] = pivot['Order Amount'] / pivot['Order number']
3.2 Outlier value detection
Methodology:
- Numerical type: Use standard deviation method (>3σ is anomaly)
- Classified variables: Chi-square test
# Numerical abnormality detection examplez_scores = (df['Order Amount'] - df['Order Amount'].mean()) / df['Order Amount'].std() outliers = df[z_scores.abs() > 3] # Classification exception detection (need to install `pandas-profiling`)# pip install pandas-profiling import pandas_profiling profile = pandas_profiling.ProfileReport(df) profile.to_file("")
3.3 Cross-table correlation analysis
Scenario: Merge order details list and customer information list
orders = pd.read_excel('') customers = pd.read_excel('') #Left connection (retain all orders)merged = ( orders, customers[['Customer ID', 'Customer Level', 'Area to which it belongs']], on='Customer ID', how='left' )
4. Performance optimization secrets
4.1 Large file processing solution
# Block reading processing (suitable for 500MB+ files)chunk_size = 50000 chunks = [] for chunk in pd.read_excel('huge_data.xlsx', chunksize=chunk_size): # Clean each chunk chunk = clean_data(chunk) (chunk) df = (chunks)
4.2 Memory optimization skills
# Convert data types to save memorydf['Order number'] = df['Order number'].astype('category') # Category Typedf['Order Amount'] = df['Order Amount'].astype('float32') # Floating point number reduction accuracy # Delete intermediate variablesdel chunk import gc () # Forced garbage collection
5. Automated report generation
5.1 Basic report output
# Generate an analysis summaryreport = f""" === Sales Data Overview === Total order count: {len(df):,} Total sales: {df['Order Amount'].sum():,.2f} Average customer order price: {df['Order Amount'].mean():,.2f} """ with open('', 'w') as f: (report) # Export processed datadf.to_excel('cleaned_data.xlsx', index=False)
5.2 Visual Integration (Matplotlib Example)
import as plt # Sales Trend Analysismonthly_sales = ('M', on='Order Date')['Order Amount'].sum() (figsize=(12,6)) monthly_sales.plot(kind='bar', color='skyblue') ('Monthly Sales Trends') ('month') ('Sales (10,000 yuan)') ('sales_trend.png', dpi=300, bbox_inches='tight')
6. Analysis of typical application scenarios
6.1 Financial reconciliation automation
process:
- Read bank statements Excel
- Convert date format
- Match internal company transaction records
- Generate a difference report
Code snippet:
bank_df = pd.read_excel('bank_statement.xlsx') internal_df = pd.read_excel('internal_records.xlsx') merged = ( bank_df, internal_df, left_on=['Trading time', 'Amount'], right_on=['Accounting time', 'The amount of occurrence'], how='outer', indicator=True ) unmatched = merged[merged['_merge'] != 'both']
6.2 Inventory warning system
logic:
Set a safe inventory threshold
Calculate turnover rate
Generate a replenishment list
inventory = pd.read_excel('') # Safe inventory calculation (consider procurement cycle)inventory['Safe Inventory'] = inventory['Average daily sales'] * 7 inventory['Inventory Status'] = ( inventory['Current inventory'] < inventory['Safe Inventory'], 'Replenishment is required', 'normal' ) alert = inventory[inventory['Inventory Status'] == 'Replenishment is required']
Conclusion: From tools to thinking upgrade
Pandas is not only a substitute for Excel, but also a carrier of data analysis thinking. By mastering core concepts such as vectorized operations, data alignment, and hierarchical indexing, analysts can:
- Free 80% of the time from repeated operations
- Easily process millions of row-level data
- Build an automated analysis pipeline
In the future, with the development of libraries such as Dask and Modin, the Pandas ecosystem will continue to break through the bottleneck of stand-alone performance and truly realize the new era of data analysis with "Excel advancement and Python empowerment". .
The above is the detailed content of the complete guide for Python Pandas to efficiently process Excel data. For more information about Python Pandas to process Excel, please follow my other related articles!