Complete Guide to Automatically Handling Missing Values of Excel Data with Python

1. Problem background

When analyzing an Excel file, missing values may be asNaN, empty cells or special symbols (such as?) exists in the form. Handling these missing values manually is time-consuming and error-prone, so an automated solution is required. For example, you may encounter the following scenarios:

Sales data: Sales in a certain month have not been recorded.
User survey form: Some respondents did not fill in their age or gender.
Sensor data: Equipment failure results in no record at some points in time.

2. Core tools and principles

Tool selection

pandas: A standard library for Python data processing, used to read Excel files and data operations.
scikit-learn: In the machine learning librarySimpleImputerModule, providing an automated method of filling missing values.

Filling strategy

Numerical data: Use column mean (mean) or median (median)filling.
Category type data: Use mode (most_frequent)filling.
Extreme situations: If the missing value accounts for too high proportion, the column or row can be deleted directly.

3. Detailed explanation of the code implementation steps

The following is the complete implementation process based on the code you provide:

Step 1: Read Excel file

import pandas as pd

# Read Excel filesdf = pd.read_excel("")

Step 2: Separate numerical and category data

# Separate numerical and non-numerical columnsnumeric_cols = df.select_dtypes(include=['number']).columns
categorical_cols = df.select_dtypes(exclude=['number']).columns

Step 3: Fill in the numerical missing value (mean fill)

from  import SimpleImputer

# Create a numerical filler (mean strategy)numeric_imputer = SimpleImputer(strategy='mean')

# Fill in the numeric column and convert it to DataFramedf_numeric = (
    numeric_imputer.fit_transform(df[numeric_cols]),
    columns=numeric_cols
)

Step 4: Fill in the missing category value (modal fill)

# Create a category type filler (modal policy)categorical_imputer = SimpleImputer(strategy='most_frequent')

# Populate the category column and convert it to DataFramedf_categorical = (
    categorical_imputer.fit_transform(df[categorical_cols]),
    columns=categorical_cols
)

Step 5: Merge the processed data

# Merge numerical and category datadf_cleaned = ([df_numeric, df_categorical], axis=1)

Step 6: Save the cleaned data

# Save as a new Excel filedf_cleaned.to_excel("cleaned_mx", index=False)

4. Notes and extensions

Things to note

Data type check：
- make sureselect_dtypesCorrectly separate numerical and category columns (such asobjectTypes may contain text or dates and require additional processing).
Outlier value detection：
- The filling mean may be affected by outliers, so the median can be used instead (strategy='median'）。
Delete policy：
- If there are too many missing values in a column (such as more than 50%), you can delete it directly:

df = (thresh=len(df)*0.5, axis=1)

Extended features

Visualize the distribution of missing values：
usemissingnoThe library quickly views the distribution of missing values:

import missingno as msno
(df).show()

Customize fill logic：
- For time series data, interpolation method can be used (interpolate()）。
- For category data, specific values can be filled (e.g.N/A）：

df_categorical.fillna("Unknown", inplace=True)

5. Complete code and examples

import pandas as pd
from  import SimpleImputer

def clean_excel_file(file_path, output_path):
    """
     Automatically handle missing values in Excel files:
     1. Numerical column fills the mean
     2. Type column fill mode
     3. Save the cleaned data
     """
    # Read data    df = pd.read_excel(file_path)
    
    # Separate numerical and category columns    numeric_cols = df.select_dtypes(include=['number']).columns
    categorical_cols = df.select_dtypes(exclude=['number']).columns
    
    # Process numeric columns    numeric_imputer = SimpleImputer(strategy='mean')
    df_numeric = (
        numeric_imputer.fit_transform(df[numeric_cols]),
        columns=numeric_cols
    )
    
    # Process Category Columns    categorical_imputer = SimpleImputer(strategy='most_frequent')
    df_categorical = (
        categorical_imputer.fit_transform(df[categorical_cols]),
        columns=categorical_cols
    )
    
    # Merge data and save    df_cleaned = ([df_numeric, df_categorical], axis=1)
    df_cleaned.to_excel(output_path, index=False)
    print(f"Data has been cleaned and saved to {output_path}")

#User Exampleclean_excel_file("", "cleaned_mx")

Summarize

With the above methods, you can quickly and automate missing values in Excel files, laying the foundation for subsequent analysis. If more complex processing is required (such as interpolation, prediction padding), you can combine it with other libraries (such asclevercsvorpandasofinterpolateMethod) Further optimization.

Next step suggestions：

Try itmode()ReplaceSimpleImputer, compare the differences in results.
Visual analysis of the cleaned data (if usedmatplotliborseaborn）。
Encapsulated as a reusable function, integrated into the data analysis workflow.

The above is the detailed content of the complete guide to automatically process missing values of Excel data using Python. For more information about Python's automatic processing of missing values of Excel, please pay attention to my other related articles!

Complete Guide to Automatically Handling Missing Values ​​of Excel Data with Python