SoFunction
Updated on 2024-12-16

Eight common ways Python cleans up data

Data cleaning is a very important step in the field of data science and machine learning. Uncleaned data may contain many problems such as missing values, outliers, duplicate values, and irrelevant features. These problems may negatively affect the analysis results and model training. In this article, we will introduce some common data cleaning methods in Python, including data previewing, missing value handling, outlier handling, data type conversion, duplicate value handling, data normalization, feature selection, and dealing with categorical data.

1、Data Preview

Before you start cleaning your data, you first need to preview the data. Using the pandas library makes it easy to view the data. Here are some common pandas functions that preview the data:

  • head(n): return the first n rows of the dataset.
  • tail(n): return the last n rows of the dataset.
  • info(): show the basic information of the dataset, including the number of non-null values in each column and the data type.
  • describe(): provides descriptive statistics of the dataset, including counts, mean, standard deviation, minimum and maximum values.

The sample code is as follows:

import pandas as pd  
  
# Read the data
df = pd.read_csv('')  
  
# Display the first 5 lines
print(())  
  
# Display the last 5 lines
print(())  
  
# Display basic information
print(())  
  
# Display descriptive statistics
print(())

2. Missing value processing

Data may contain missing values, which may be unrecorded or unavailable for some reason. Common ways to deal with missing values are to delete rows or columns containing missing values, to fill in missing values, or to interpolate. Here are a few common pandas functions that deal with missing values:

  • fillna(value): fills the missing values with the specified values.
  • ffill(): fills missing values with the previous non-null value.
  • bfill(): fills missing values with the latter non-null value.
  • dropna(): removes rows or columns containing missing values.

The sample code is as follows:

import pandas as pd  
  
# Read the data
df = pd.read_csv('')  
  
# Fill missing values to 0
(0, inplace=True)  
  
# Fill in missing values with previous non-null values
df['column_name'].ffill(inplace=True)  
  
# Fill in missing values with the next non-null value
df['column_name'].bfill(inplace=True)  
  
# Delete rows containing missing values
df = ()

3. Handling of outliers

The data may also contain outliers that can negatively affect the results of the analysis. Common ways to deal with outliers are to delete rows or columns that contain outliers, treat outliers as missing values, or use some method of correction. The following are a few common pandas functions for handling outliers:

  • drop(): Remove rows or columns containing outliers.
  • clip(): clips values outside the specified range to the boundary values.
  • boxplot(): plots a boxplot, which can help identify outliers.
  • hist(): plots a histogram, which can help identify outliers.
  • zscore(): calculates how much each value deviates from the mean and can help identify outliers.
  • iqr(): calculates the interquartile range, which can help identify outliers.
  • edgeworth(): computes the empirical distribution function, which can help identify outliers.

4. Data type conversion

In data analysis, many times it is necessary to convert data to the appropriate type. For example, converting a string to an integer or floating point number, or converting a date and time to a specific format. Data type conversions can be easily performed using the asype() function of pandas. In addition, you can use the to_datetime() function to convert a datetime string to a datetime object.

The sample code is as follows:

import pandas as pd  
  
# Convert strings to integers
df['column_name'] = df['column_name'].astype(int)  
  
# Convert strings to floating point numbers
df['column_name'] = df['column_name'].astype(float)  
  
# Convert datetime strings to datetime objects
df['column_name'] = pd.to_datetime(df['column_name'])

5. Repeat value processing

The data may contain duplicate rows, and these duplicate values may interfere with data analysis. Duplicate rows can be easily removed using the drop_duplicates() function of pandas. You can define what constitutes a duplicate based on the values in one or more columns.

The sample code is as follows:

import pandas as pd  
  
# Delete duplicate rows
df = df.drop_duplicates()

6. Data standardization

Normalizing data to the same scale helps in comparison and analysis. Data normalization can be done using scikit-learn's StandardScaler, which transforms the data into a mean of 0 and a standard deviation of 1.

The sample code is as follows:

from  import StandardScaler  
  
# Create standardizers
scaler = StandardScaler()  
  
# Standardized data
df = scaler.fit_transform(df)

7. Feature selection

Before a machine learning model is trained, it is critical to select the most predictive and representative features. This can help the model understand the data better and improve the prediction accuracy. Here are some common methods of feature selection:

Filtering: The most predictive features are selected by calculating the correlation coefficient or chi-square statistic for each feature. For example, the correlation between features is calculated using the corr() function, or the chi2() function is used to calculate the chi-square statistic between a feature and the target variable.

The sample code is as follows:

import pandas as pd  
from sklearn.feature_selection import SelectKBest, chi2  
  
# Calculate the chi-square statistic between the feature and the target variable
kbest = SelectKBest(score_func=chi2, k=10)  
X_new = kbest.fit_transform(('target_column', axis=1), df['target_column'])
packaging method:Select the most predictive features by training the model and calculating the importance of features。For example, using a random forest、XGBoostand other models for feature selection。The sample code is as follows:

python
from  import RandomForestClassifier  
from sklearn.feature_selection import SelectFromModel  
  
# Train random forest classifiers
clf = RandomForestClassifier()  
(('target_column', axis=1), df['target_column'])  
  
# Create a feature selector and select the 10 most important features
sfm = SelectFromModel(clf, threshold=0.15, prefit=True)  
X_new = (('target_column', axis=1))

8. Processing of category data

Categorical data, also known as categorical data or fixed-class data, is a discrete variable. For this kind of data, we usually need to encode it, e.g., using solo thermal encoding or labeling encoding.

The following is an example of using pandas to process category data:

import pandas as pd  
  
# Read the data
df = pd.read_csv('')  
  
# Conversion of category data to unique heat codes
df = pd.get_dummies(df, columns=['column_name'])  
  
# Convert category data to tag codes
df['column_name'] = df['column_name'].map({'Category 1': 1, 'Category 2': 2, 'Category 3': 3})
Data de-duplication
If there are duplicate rows in the data,It is possible to usepandas(used form a nominal expression)duplicated()function for de-duplication。The sample code is as follows:

python
import pandas as pd  
  
# Read the data
df = pd.read_csv('')  
  
# Delete duplicate rows
df = df.drop_duplicates()
data sorting
For certain data analysis tasks,We need to sort the data。for example,按时间顺序查看数据(used form a nominal expression)发展趋势。utilizationpandas(used form a nominal expression)sort_values()Functions make it easy to sort data。The sample code is as follows:

python
import pandas as pd  
  
# Read the data
df = pd.read_csv('')  
  
# Sort by time column in ascending order
df = df.sort_values(by='time_column')

These are the commonly used data cleaning and preprocessing methods in Python. These steps are an important foundation for data analysis and machine learning, allowing us to extract useful information from messy data and making analysis and modeling easier and more accurate.

summarize

In this discussion, we detail the various benefits and capabilities of Python as a data analysis tool. By using Python, we can work with large amounts of data quickly and efficiently, perform data cleaning and preprocessing, as well as analysis and modeling. Python has a wide range of applications, whether in data science, machine learning, or in areas such as web crawling and automation.

Although Python has many advantages, in practice, we also need to be flexible in choosing to use other tools, such as R, SAS, SPSS, etc., depending on the specific situation. These tools may be more specialized and efficient in some specific data analysis tasks.

Finally, in order to better understand and master Python data analytics, we suggest that readers need to learn not only the basic syntax of Python, but also in-depth study of related data analytics libraries, such as Pandas, NumPy, and Matplotlib. By practicing and accumulating experience, we can continue to improve our data analysis ability and thus achieve greater success in the data field.

To this article on Python clean up data eight common methods are introduced to this article, more related to Python clean up data content please search my previous posts or continue to browse the following related articles I hope you will support me more in the future!