Detailed steps for Python to analyze and process excel file data

1. Preparation

1. Install the necessary libraries

First, you need to install Python's data processing and Excel processing library:

pip install pandas openpyxl xlrd

Notice:

pandasIt is the core data processing library
openpyxlUsed for processing.xlsxExcel files in format
xlrdUsed to deal with older ones.xlsFormat (Since xlrd 2.0.0, no longer supports .xlsx)

2. Prepare Excel files

Suppose we have a name calledsales_data.xlsxThe Excel file contains the following data:

date	product	Sales	unit price	Sales
2023-01-01	Product A	10	100	1000
2023-01-01	Product B	5	200	1000
2023-01-02	Product A	8	100	800
2023-01-02	Product C	12	150	1800
...	...	...	...	...

2. Read Excel files

1. Read with pandas

import pandas as pd
 
# Read the entire worksheetdf = pd.read_excel('sales_data.xlsx')
 
# Show the first 5 rows of dataprint(())
 
# Read a specific worksheet (if there are multiple worksheets)# df = pd.read_excel('sales_data.xlsx', sheet_name='Sheet1')
 
# Read a specific column# df = pd.read_excel('sales_data.xlsx', usecols=['date', 'product', 'Sales'])

2. Use openpyxl to read

from openpyxl import load_workbook
 
# Load the workbookwb = load_workbook('sales_data.xlsx')
 
# Get the active worksheet or specify the worksheetsheet =   # or wb['Sheet1'] 
# Read datadata = []
for row in sheet.iter_rows(values_only=True):
    (row)
 
# Convert to DataFrame (optional)import pandas as pd
df = (data[1:], columns=data[0])  # Assume that the first line is the title

3. Basic data operations

1. View data information

# View basic data informationprint(())
 
# View statistics summaryprint(())
 
# View column namesprint(())

2. Data filtering

# Filter data for specific datesjan_data = df[df['date'] == '2023-01-01']
 
# Filter products with sales volume greater than 5high_sales = df[df['Sales'] &gt; 5]
 
# Filter multiple criteriafiltered_data = df[(df['date'] &gt;= '2023-01-01') &amp; (df['product'] == 'Product A')]

3. Data grouping and aggregation

# Total sales and total sales by product groupingproduct_stats = ('product').agg({
    'Sales': 'sum',
    'Sales': 'sum'
}).reset_index()
 
print(product_stats)
 
# Calculate the sum of daily salesdaily_sales = ('date')['Sales'].sum().reset_index()

4. Data sorting

# Sort by sales in descending ordersorted_data = df.sort_values('Sales', ascending=False)
 
# Sort by date and salessorted_data = df.sort_values(['date', 'Sales'], ascending=[True, False])

4. Data visualization

1. Use matplotlib to draw a chart

import  as plt
 
# Set Chinese fonts (avoid Chinese display problems)['-serif'] = ['SimHei']
['axes.unicode_minus'] = False
 
# Draw a bar chart - Total sales of each productproduct_stats.plot(kind='bar', x='product', y='Sales', title='Total sales of each product')
('Sales')
()
 
# Draw a line chart - Daily sales trendsdaily_sales.plot(kind='line', x='date', y='Sales', title='Daily Sales Trend')
('date')
('Sales')
(rotation=45)
plt.tight_layout()
()

2. Use seaborn for advanced visualization

pip install seaborn

import seaborn as sns
 
# Set style(style="whitegrid")
 
# Draw a box line chart - sales distribution of each product(figsize=(10, 6))
(x='product', y='Sales', data=df)
('Sales distribution of each product')
()
 
# Drawing heat maps - Correlation analysiscorr_matrix = df[['Sales', 'unit price', 'Sales']].corr()
(corr_matrix, annot=True, cmap='coolwarm')
('Variable correlation heat map')
()

5. Data processing and cleaning

1. Handle missing values

# Check for missing valuesprint(().sum())
 
# Fill in missing valuesdf_filled = ({'Sales': 0, 'unit price': df['unit price'].mean()})
 
# Delete rows containing missing valuesdf_dropped = ()

2. Data type conversion

# Convert date formatdf['date'] = pd.to_datetime(df['date'])
 
# Convert numerical typesdf['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')
df['unit price'] = pd.to_numeric(df['unit price'], errors='coerce')
df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')

3. Data standardization

from  import StandardScaler
 
# Select the column that needs to be standardizedfeatures = df[['Sales', 'unit price', 'Sales']]
 
# Standardized processingscaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
 
# Convert back to DataFramescaled_df = (scaled_features, columns=)

6. Advanced analysis technology

1. Time series analysis

# Make sure the date is of datetime typedf['date'] = pd.to_datetime(df['date'])
 
# Set date to indexdf.set_index('date', inplace=True)
 
# Aggregate sales by weekweekly_sales = ('W')['Sales'].sum()
 
# Moving Averagedf['7-day moving average sales'] = df['Sales'].rolling(window=7).mean()

2. Correlation analysis

# Calculate the correlation matrixcorr_matrix = df[['Sales', 'unit price', 'Sales']].corr()
 
# Visualize Relevanceimport seaborn as sns
import  as plt
 
(figsize=(8, 6))
(corr_matrix, annot=True, cmap='coolwarm', center=0)
('Variable correlation heat map')
()

3. Grouping Aggregation and Perspective Table

# Grouping aggregation using groupbygrouped = (['product', 'date']).agg({
    'Sales': 'sum',
    'Sales': 'sum'
}).reset_index()
 
# Create a perspective tablepivot_table = df.pivot_table(
    values='Sales',
    index='date',
    columns='product',
    aggfunc='sum',
    fill_value=0
)
 
print(pivot_table)

7. Complete example

Here is a complete analysis process example:

import pandas as pd
import  as plt
import seaborn as sns
from datetime import datetime
 
# 1. Read datadf = pd.read_excel('sales_data.xlsx')
 
# 2. Data Cleaningdf['date'] = pd.to_datetime(df['date'])
df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce').fillna(0)
df['unit price'] = pd.to_numeric(df['unit price'], errors='coerce').fillna(df['unit price'].mean())
df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce').fillna(0)
 
# 3. Basic statisticsprint("Basic Statistics:")
print(())
 
# 4. Group statistics by productproduct_stats = ('product').agg({
    'Sales': 'sum',
    'Sales': 'sum',
    'unit price': 'mean'
}).sort_values('Sales', ascending=False)
 
print("\nSales of each product:")
print(product_stats)
 
# 5. Time Series Analysisdf.set_index('date', inplace=True)
daily_sales = ('D')['Sales'].sum()
 
# 6. Visualization(figsize=(15, 10))
 
# Daily sales trends(2, 2, 1)
daily_sales.plot(title='Daily Sales Trend')
('Sales')
 
# Sales comparison of each product(2, 2, 2)
product_stats['Sales'].plot(kind='bar', title='Total sales of each product')
('Sales')
 
# Sales volume and unit price relationship(2, 2, 3)
(data=df, x='unit price', y='Sales', hue='product')
('Sales volume and unit price relationship')
('unit price')
('Sales')
 
# Product sales share(2, 2, 4)
product_stats['Sales'].plot(kind='pie', autopct='%1.1f%%', startangle=90)
('Product sales share')
('')  # Remove the default ylabel 
plt.tight_layout()
()

8. Performance optimization skills

For large Excel files, consider the following optimization methods:

Read only the required columns：

df = pd.read_excel('large_file.xlsx', usecols=['date', 'product', 'Sales'])

Block reading：

chunk_size = 10000
chunks = pd.read_excel('very_large_file.xlsx', chunksize=chunk_size)
 
for chunk in chunks:
    process(chunk)  # Process each block of data

Use more efficient file formats：
- Convert Excel to CSV post-processing (usually faster)
- Store intermediate data using Parquet or Feather format
Parallel processing：

import  as dd
 
# Use Dask to process large datasetsddf = dd.read_excel('large_file.xlsx')
result = ('product').Sales.sum().compute()

9. Frequently Asked Questions

Chinese display problem：

['-serif'] = ['SimHei']  # Set Chinese fonts['axes.unicode_minus'] = False    # Solve the problem of displaying minus signs

Date format inconsistent：

# Try multiple date format parsingdf['date'] = pd.to_datetime(df['date'], errors='coerce', format='%Y-%m-%d')
df['date'] = pd.to_datetime(df['date'], errors='coerce', format='%d/%m/%Y')
df['date'].fillna(pd.to_datetime('1900-01-01'), inplace=True)  # Processing unresolved dates

Un-memory error：
- usedtypeParameters specify the column's data type to reduce memory usage
- Process large files in chunks
- Use more efficient file formats

10. Expand the analysis direction

Predictive Analytics：
- Predict future sales using time series models
- Applied machine learning models to predict product demand
Customer segmentation：
- Customer grouping based on purchasing behavior
- Building an RFM model (recent purchases, frequency, amount)
Exception detection：
- Identify abnormal sales records
- Detect abnormal patterns in data
Geospatial Analysis：
- If the data contains geographic location information, geographic visualization can be performed
- Analyze sales performance in different regions

The above is the detailed content of the detailed steps for Python to analyze and process Excel file data. For more information about Python to analyze and process Excel data, please pay attention to my other related articles!