Data standardization (normalization) processing is a basic work of data mining, different evaluation indicators often have different magnitudes and magnitude units, such a situation will affect the results of data analysis, in order to eliminate the impact of the magnitude of the indicators, data standardization processing is needed to address the comparability of data indicators. After the raw data are processed by data normalization, the indicators are in the same order of magnitude, which is suitable for comprehensive comparative evaluation. The following are three commonly used normalization methods:
min-max Normalization (Min-Max Normalization)
Also known as deviation normalization, it is a linear transformation of the original data such that the resultant values are mapped between [0 , 1]. The transformation function is as follows:
Where max is the maximum value of the sample data and min is the minimum value of the sample data. One drawback of this method is that when new data is added, it may cause max and min to change and need to be redefined.
The min-max normalized python code is as follows:
import numpy as np arr = ([0, 10, 50, 80, 100]) for x in arr: x = float(x - (arr))/((arr)- (arr)) print x # output # 0.0 # 0.1 # 0.5 # 0.8 # 1.0
The purposes of using this method include:
1, for properties with very small variance can be enhanced for stability;
2. maintain entries that are 0 in the sparse matrix.
The following scales the data to between 0 and 1, using the MinMaxScaler function
from sklearn import preprocessing import numpy as np X = ([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]]) min_max_scaler = () X_minMax = min_max_scaler.fit_transform(X)
Final output:
array([[ 0.5 , 0. , 1. ],
[ 1. , 0.5 , 0.33333333],
[ 0. , 1. , 0. ]])
Test cases:
Note: These transformations are processed on columns.
Of course, it is also possible to directly specify the range of maximum and minimum values when constructing the class object: feature_range=(min, max), at which point the applied formula becomes:
X_std=((axis=0))/((axis=0)-(axis=0)) X_minmax=X_std/((axis=0)-(axis=0))+(axis=0))
Z-score standardization method
Also known as mean normalization (mean normaliztion), given the original data mean (mean) and standard deviation (standard deviation) to standardize the data. The processed data conforms to the standard normal distribution, i.e., the mean is 0 and the standard deviation is 1. The transformation function is:
included among theseμμ is the mean of all the sample data.σσ is the standard deviation of all sample data.
import numpy as np arr = ([0, 10, 50, 80, 100]) for x in arr: x = float(x - ())/() print x # output # -1.24101045599 # -0.982466610991 # 0.0517087689995 # 0.827340303992 # 1.34442799399
This is the whole content of this article.