1. Preparation
1. Install the necessary libraries
First, you need to install Python's data processing and Excel processing library:
pip install pandas openpyxl xlrd
Notice:
-
pandas
It is the core data processing library -
openpyxl
Used for processing.xlsx
Excel files in format -
xlrd
Used to deal with older ones.xls
Format (Since xlrd 2.0.0, no longer supports .xlsx)
2. Prepare Excel files
Suppose we have a name calledsales_data.xlsx
The Excel file contains the following data:
date | product | Sales | unit price | Sales |
---|---|---|---|---|
2023-01-01 | Product A | 10 | 100 | 1000 |
2023-01-01 | Product B | 5 | 200 | 1000 |
2023-01-02 | Product A | 8 | 100 | 800 |
2023-01-02 | Product C | 12 | 150 | 1800 |
... | ... | ... | ... | ... |
2. Read Excel files
1. Read with pandas
import pandas as pd # Read the entire worksheetdf = pd.read_excel('sales_data.xlsx') # Show the first 5 rows of dataprint(()) # Read a specific worksheet (if there are multiple worksheets)# df = pd.read_excel('sales_data.xlsx', sheet_name='Sheet1') # Read a specific column# df = pd.read_excel('sales_data.xlsx', usecols=['date', 'product', 'Sales'])
2. Use openpyxl to read
from openpyxl import load_workbook # Load the workbookwb = load_workbook('sales_data.xlsx') # Get the active worksheet or specify the worksheetsheet = # or wb['Sheet1'] # Read datadata = [] for row in sheet.iter_rows(values_only=True): (row) # Convert to DataFrame (optional)import pandas as pd df = (data[1:], columns=data[0]) # Assume that the first line is the title
3. Basic data operations
1. View data information
# View basic data informationprint(()) # View statistics summaryprint(()) # View column namesprint(())
2. Data filtering
# Filter data for specific datesjan_data = df[df['date'] == '2023-01-01'] # Filter products with sales volume greater than 5high_sales = df[df['Sales'] > 5] # Filter multiple criteriafiltered_data = df[(df['date'] >= '2023-01-01') & (df['product'] == 'Product A')]
3. Data grouping and aggregation
# Total sales and total sales by product groupingproduct_stats = ('product').agg({ 'Sales': 'sum', 'Sales': 'sum' }).reset_index() print(product_stats) # Calculate the sum of daily salesdaily_sales = ('date')['Sales'].sum().reset_index()
4. Data sorting
# Sort by sales in descending ordersorted_data = df.sort_values('Sales', ascending=False) # Sort by date and salessorted_data = df.sort_values(['date', 'Sales'], ascending=[True, False])
4. Data visualization
1. Use matplotlib to draw a chart
import as plt # Set Chinese fonts (avoid Chinese display problems)['-serif'] = ['SimHei'] ['axes.unicode_minus'] = False # Draw a bar chart - Total sales of each productproduct_stats.plot(kind='bar', x='product', y='Sales', title='Total sales of each product') ('Sales') () # Draw a line chart - Daily sales trendsdaily_sales.plot(kind='line', x='date', y='Sales', title='Daily Sales Trend') ('date') ('Sales') (rotation=45) plt.tight_layout() ()
2. Use seaborn for advanced visualization
pip install seaborn
import seaborn as sns # Set style(style="whitegrid") # Draw a box line chart - sales distribution of each product(figsize=(10, 6)) (x='product', y='Sales', data=df) ('Sales distribution of each product') () # Drawing heat maps - Correlation analysiscorr_matrix = df[['Sales', 'unit price', 'Sales']].corr() (corr_matrix, annot=True, cmap='coolwarm') ('Variable correlation heat map') ()
5. Data processing and cleaning
1. Handle missing values
# Check for missing valuesprint(().sum()) # Fill in missing valuesdf_filled = ({'Sales': 0, 'unit price': df['unit price'].mean()}) # Delete rows containing missing valuesdf_dropped = ()
2. Data type conversion
# Convert date formatdf['date'] = pd.to_datetime(df['date']) # Convert numerical typesdf['Sales'] = pd.to_numeric(df['Sales'], errors='coerce') df['unit price'] = pd.to_numeric(df['unit price'], errors='coerce') df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')
3. Data standardization
from import StandardScaler # Select the column that needs to be standardizedfeatures = df[['Sales', 'unit price', 'Sales']] # Standardized processingscaler = StandardScaler() scaled_features = scaler.fit_transform(features) # Convert back to DataFramescaled_df = (scaled_features, columns=)
6. Advanced analysis technology
1. Time series analysis
# Make sure the date is of datetime typedf['date'] = pd.to_datetime(df['date']) # Set date to indexdf.set_index('date', inplace=True) # Aggregate sales by weekweekly_sales = ('W')['Sales'].sum() # Moving Averagedf['7-day moving average sales'] = df['Sales'].rolling(window=7).mean()
2. Correlation analysis
# Calculate the correlation matrixcorr_matrix = df[['Sales', 'unit price', 'Sales']].corr() # Visualize Relevanceimport seaborn as sns import as plt (figsize=(8, 6)) (corr_matrix, annot=True, cmap='coolwarm', center=0) ('Variable correlation heat map') ()
3. Grouping Aggregation and Perspective Table
# Grouping aggregation using groupbygrouped = (['product', 'date']).agg({ 'Sales': 'sum', 'Sales': 'sum' }).reset_index() # Create a perspective tablepivot_table = df.pivot_table( values='Sales', index='date', columns='product', aggfunc='sum', fill_value=0 ) print(pivot_table)
7. Complete example
Here is a complete analysis process example:
import pandas as pd import as plt import seaborn as sns from datetime import datetime # 1. Read datadf = pd.read_excel('sales_data.xlsx') # 2. Data Cleaningdf['date'] = pd.to_datetime(df['date']) df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce').fillna(0) df['unit price'] = pd.to_numeric(df['unit price'], errors='coerce').fillna(df['unit price'].mean()) df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce').fillna(0) # 3. Basic statisticsprint("Basic Statistics:") print(()) # 4. Group statistics by productproduct_stats = ('product').agg({ 'Sales': 'sum', 'Sales': 'sum', 'unit price': 'mean' }).sort_values('Sales', ascending=False) print("\nSales of each product:") print(product_stats) # 5. Time Series Analysisdf.set_index('date', inplace=True) daily_sales = ('D')['Sales'].sum() # 6. Visualization(figsize=(15, 10)) # Daily sales trends(2, 2, 1) daily_sales.plot(title='Daily Sales Trend') ('Sales') # Sales comparison of each product(2, 2, 2) product_stats['Sales'].plot(kind='bar', title='Total sales of each product') ('Sales') # Sales volume and unit price relationship(2, 2, 3) (data=df, x='unit price', y='Sales', hue='product') ('Sales volume and unit price relationship') ('unit price') ('Sales') # Product sales share(2, 2, 4) product_stats['Sales'].plot(kind='pie', autopct='%1.1f%%', startangle=90) ('Product sales share') ('') # Remove the default ylabel plt.tight_layout() ()
8. Performance optimization skills
For large Excel files, consider the following optimization methods:
Read only the required columns:
df = pd.read_excel('large_file.xlsx', usecols=['date', 'product', 'Sales'])
Block reading:
chunk_size = 10000 chunks = pd.read_excel('very_large_file.xlsx', chunksize=chunk_size) for chunk in chunks: process(chunk) # Process each block of data
-
Use more efficient file formats:
- Convert Excel to CSV post-processing (usually faster)
- Store intermediate data using Parquet or Feather format
Parallel processing:
import as dd # Use Dask to process large datasetsddf = dd.read_excel('large_file.xlsx') result = ('product').Sales.sum().compute()
9. Frequently Asked Questions
Chinese display problem:
['-serif'] = ['SimHei'] # Set Chinese fonts['axes.unicode_minus'] = False # Solve the problem of displaying minus signs
Date format inconsistent:
# Try multiple date format parsingdf['date'] = pd.to_datetime(df['date'], errors='coerce', format='%Y-%m-%d') df['date'] = pd.to_datetime(df['date'], errors='coerce', format='%d/%m/%Y') df['date'].fillna(pd.to_datetime('1900-01-01'), inplace=True) # Processing unresolved dates
-
Un-memory error:
- use
dtype
Parameters specify the column's data type to reduce memory usage - Process large files in chunks
- Use more efficient file formats
- use
10. Expand the analysis direction
-
Predictive Analytics:
- Predict future sales using time series models
- Applied machine learning models to predict product demand
-
Customer segmentation:
- Customer grouping based on purchasing behavior
- Building an RFM model (recent purchases, frequency, amount)
-
Exception detection:
- Identify abnormal sales records
- Detect abnormal patterns in data
-
Geospatial Analysis:
- If the data contains geographic location information, geographic visualization can be performed
- Analyze sales performance in different regions
The above is the detailed content of the detailed steps for Python to analyze and process Excel file data. For more information about Python to analyze and process Excel data, please pay attention to my other related articles!