Python3 Common Data Normalization Methods Explained

Data normalization is a common method in machine learning and data mining. Including myself when doing research in deep learning, data normalization is one of the most basic steps.

Data normalization is mainly to cope with the situation where the data in the feature vectors are very dispersed and to prevent small data from being subsumed by large data (in absolute value).

Also, data normalization has the effect of accelerating training and preventing gradient explosion.

Below are two pictures taken from Prof. Hongyi Li's video.

The left figure represents the loss update function without data normalization, and the right figure represents the loss update plot after data normalization. It can be seen that the normalized data is easier to iterate to the optimal point and converges faster.

I. [0, 1] Standardization

[0, 1] Normalization is one of the most basic methods of data standardization and refers to compressing data to between 0 and 1.

The standardized formula is as follows

code implementation

def MaxMinNormalization(x, min, max):
  """[0,1] normaliaztion"""
  x = (x - min) / (max - min)
  return x

def MaxMinNormalization(x):
  """[0,1] normaliaztion"""
  x = (x - (x)) / ((x) - (x))
  return x

II. Z-score standardization

Z-score standardization is a standardization method based on the mean and variance of the data. The standardized data is normally distributed with mean 0 and variance 1. This method requires that the distribution of the original data can be approximated as a Gaussian distribution, otherwise the results will be poor.

The standardized formula is as follows

In the following, we will see why the data after this standardized method is processed with a mean of 0 and a variance of 1.

code implementation

def ZscoreNormalization(x, mean_, std_):
  """Z-score normaliaztion"""
  x = (x - mean_) / std_
  return x

def ZscoreNormalization(x):
  """Z-score normaliaztion"""
  x = (x - (x)) / (x)
  return x

Addendum: Python data preprocessing: a thorough understanding of normalization and normalization

Data preprocessing

The magnitude of different features in the data may be inconsistent, and the difference between the values may be so large that failure to process them may affect the results of the data analysis; therefore, the data need to be scaled according to a certain proportion so that they fall into a specific area to facilitate comprehensive analysis.

There are two commonly used methods:

Maximum - Minimum Normalization: performs a linear transformation on the original data, mapping the data to the [0,1] interval

Z-Score normalization: mapping raw data to a distribution with mean 0 and standard deviation 1

Why standardization/normalization?

Improve model accuracy: after standardization/normalization, features between different dimensions are numerically comparable, which can greatly improve the accuracy of the classifier.

Accelerated model convergence: after standardization/normalization, the optimal solution finding process will obviously become smoother, and it is easier to converge to the optimal solution correctly.

As shown in the figure below:

Which machine learning algorithms need to be standardized and normalized

(1) need to use gradient descent and calculate the distance of the model to do normalization, because do not do normalization will make the convergence of the path program z-shaped decline, resulting in the convergence of the path is too slow, and it is not easy to find the optimal solution, after the normalization of the gradient descent to find the optimal solution to accelerate the speed, and may improve the accuracy. For example, linear regression, logistic regression, adaboost, xgboost, GBDT, SVM, NeuralNetwork and so on. Models that need to calculate distance need to be normalized, such as KNN, KMeans, etc.

2) Probabilistic models, tree-structured models do not require normalization because they do not care about the values of the variables, but about the distribution of the variables and the conditional probabilities between the variables, e.g., decision trees, random forests.

Thorough understanding of standardization and normalization

The example dataset contains one independent variable (purchased) and three dependent variables (country, age, and salary), and it can be seen that using the salary range is much wider than the age, and if the data is used directly in a machine learning model (e.g., KNN, KMeans), the model will be fully salary-dominated.

#Import data
import numpy as np
import  as plt
import pandas as pd
df = pd.read_csv('')

Missing value mean padding, handling character variables

df['Salary'].fillna((df['Salary'].mean()), inplace= True)
df['Age'].fillna((df['Age'].mean()), inplace= True)
df['Purchased'] = df['Purchased'].apply(lambda x: 0 if x=='No' else 1)
df=pd.get_dummies(data=df, columns=['Country'])

Maximum - Minimum Normalization

from  import MinMaxScaler
scaler = MinMaxScaler()
(df)
scaled_features = (df)
df_MinMax = (data=scaled_features, columns=["Age", "Salary","Purchased","Country_France","Country_Germany", "Country_spain"])

Z-Score standardization

from  import StandardScaler
sc_X = StandardScaler()
sc_X = sc_X.fit_transform(df)
sc_X = (data=sc_X, columns=["Age", "Salary","Purchased","Country_France","Country_Germany", "Country_spain"])

import seaborn as sns
import  as plt
import statistics
['-serif'] = ['Microsoft YaHei']
fig,axes=(2,3,figsize=(18,12)) 
(df['Age'], ax=axes[0, 0])
(df_MinMax['Age'], ax=axes[0, 1])
axes[0, 1].set_title(' Normalized variance: % s '% ((df_MinMax['Age'])))
(sc_X['Age'], ax=axes[0, 2])
axes[0, 2].set_title(' Standardized variance: % s '% ((sc_X['Age'])))
(df['Salary'], ax=axes[1, 0])
(df_MinMax['Salary'], ax=axes[1, 1])
axes[1, 1].set_title('MinMax：Salary')
axes[1, 1].set_title(' Normalized variance: % s '% ((df_MinMax['Salary'])))
(sc_X['Salary'], ax=axes[1, 2])
axes[1, 2].set_title('StandardScaler:Salary')
axes[1, 2].set_title(' Standardized variance: % s '% ((sc_X['Salary'])))

It can be seen that normalization produces a smaller standard deviation than standardized methods, and by using normalization to scale the data, the data will be more concentrated around the mean. This is due to the fact that normalized scaling is "flat" and uniform to the interval (determined only by the extreme values), whereas standardized scaling is more "elastic" and "dynamic" and has a lot to do with the overall distribution of the sample. The standardized scaling is more "elastic" and "dynamic", and has a lot to do with the overall sample distribution.

So normalization does not handle outliers well, whereas normalization is robust to outliers and in many cases it outperforms normalization.

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more. If there is any mistake or something that has not been fully considered, please do not hesitate to give me advice.