1. Problem background
When analyzing an Excel file, missing values may be asNaN
, empty cells or special symbols (such as?
) exists in the form. Handling these missing values manually is time-consuming and error-prone, so an automated solution is required. For example, you may encounter the following scenarios:
- Sales data: Sales in a certain month have not been recorded.
- User survey form: Some respondents did not fill in their age or gender.
- Sensor data: Equipment failure results in no record at some points in time.
2. Core tools and principles
Tool selection
-
pandas
: A standard library for Python data processing, used to read Excel files and data operations. -
scikit-learn
: In the machine learning librarySimpleImputer
Module, providing an automated method of filling missing values.
Filling strategy
-
Numerical data: Use column mean (
mean
) or median (median
)filling. -
Category type data: Use mode (
most_frequent
)filling. - Extreme situations: If the missing value accounts for too high proportion, the column or row can be deleted directly.
3. Detailed explanation of the code implementation steps
The following is the complete implementation process based on the code you provide:
Step 1: Read Excel file
import pandas as pd # Read Excel filesdf = pd.read_excel("")
Step 2: Separate numerical and category data
# Separate numerical and non-numerical columnsnumeric_cols = df.select_dtypes(include=['number']).columns categorical_cols = df.select_dtypes(exclude=['number']).columns
Step 3: Fill in the numerical missing value (mean fill)
from import SimpleImputer # Create a numerical filler (mean strategy)numeric_imputer = SimpleImputer(strategy='mean') # Fill in the numeric column and convert it to DataFramedf_numeric = ( numeric_imputer.fit_transform(df[numeric_cols]), columns=numeric_cols )
Step 4: Fill in the missing category value (modal fill)
# Create a category type filler (modal policy)categorical_imputer = SimpleImputer(strategy='most_frequent') # Populate the category column and convert it to DataFramedf_categorical = ( categorical_imputer.fit_transform(df[categorical_cols]), columns=categorical_cols )
Step 5: Merge the processed data
# Merge numerical and category datadf_cleaned = ([df_numeric, df_categorical], axis=1)
Step 6: Save the cleaned data
# Save as a new Excel filedf_cleaned.to_excel("cleaned_mx", index=False)
4. Notes and extensions
Things to note
-
Data type check:
- make sure
select_dtypes
Correctly separate numerical and category columns (such asobject
Types may contain text or dates and require additional processing).
- make sure
-
Outlier value detection:
- The filling mean may be affected by outliers, so the median can be used instead (
strategy='median'
)。
- The filling mean may be affected by outliers, so the median can be used instead (
-
Delete policy:
- If there are too many missing values in a column (such as more than 50%), you can delete it directly:
df = (thresh=len(df)*0.5, axis=1)
Extended features
-
Visualize the distribution of missing values:
usemissingno
The library quickly views the distribution of missing values:
import missingno as msno (df).show()
-
Customize fill logic:
- For time series data, interpolation method can be used (
interpolate()
)。 - For category data, specific values can be filled (e.g.
N/A
):
- For time series data, interpolation method can be used (
df_categorical.fillna("Unknown", inplace=True)
5. Complete code and examples
import pandas as pd from import SimpleImputer def clean_excel_file(file_path, output_path): """ Automatically handle missing values in Excel files: 1. Numerical column fills the mean 2. Type column fill mode 3. Save the cleaned data """ # Read data df = pd.read_excel(file_path) # Separate numerical and category columns numeric_cols = df.select_dtypes(include=['number']).columns categorical_cols = df.select_dtypes(exclude=['number']).columns # Process numeric columns numeric_imputer = SimpleImputer(strategy='mean') df_numeric = ( numeric_imputer.fit_transform(df[numeric_cols]), columns=numeric_cols ) # Process Category Columns categorical_imputer = SimpleImputer(strategy='most_frequent') df_categorical = ( categorical_imputer.fit_transform(df[categorical_cols]), columns=categorical_cols ) # Merge data and save df_cleaned = ([df_numeric, df_categorical], axis=1) df_cleaned.to_excel(output_path, index=False) print(f"Data has been cleaned and saved to {output_path}") #User Exampleclean_excel_file("", "cleaned_mx")
Summarize
With the above methods, you can quickly and automate missing values in Excel files, laying the foundation for subsequent analysis. If more complex processing is required (such as interpolation, prediction padding), you can combine it with other libraries (such asclevercsv
orpandas
ofinterpolate
Method) Further optimization.
Next step suggestions:
- Try it
mode()
ReplaceSimpleImputer
, compare the differences in results. - Visual analysis of the cleaned data (if used
matplotlib
orseaborn
)。 - Encapsulated as a reusable function, integrated into the data analysis workflow.
The above is the detailed content of the complete guide to automatically process missing values of Excel data using Python. For more information about Python's automatic processing of missing values of Excel, please pay attention to my other related articles!