The essence of machine learning is to discover the intrinsic features of data from a dataset, while the intrinsic features of data are often hidden by the specification of samples, distribution range and other extrinsic features. Data preprocessing is precisely a series of operations done to maximally help machine learning models or algorithms to find the intrinsic features of the data, and these operations mainly include normalization, normalization, regularization, discretization, and whitening.
1 Standardization
Assume that the sample set is a number of points on a two-dimensional plane, with the horizontal coordinate x distributed in the interval [0,100] and the vertical coordinate y distributed in the interval [0,1]. Obviously, the dynamic range of the x- and y-feature columns of the sample set is very different, and the impact on the machine learning model (e.g., k-nearest neighbor or k-means clustering) can be significantly different. The normalization process is designed to avoid the impact of a feature column with too large a dynamic range on the computation results, and at the same time improve the model accuracy. Normalization is essentially centering each feature column of a sample set by subtracting the mean of that feature column and dividing by the standard deviation for scaling.
Scikit-learn's preprocessing submodule preprocessing provides a fast normalization function scale(), using which you can directly return the normalized dataset with the following code.
>>> import numpy as np >>> from sklearn import preprocessing as pp >>> d = ([[ 1., -5., 8.], [ 2., -3., 0.], [ 0., -1., 1.]]) >>> d_scaled = (d) # Normalization of datasetd >>> d_scaled array([[ 0. , -1.22474487, 1.40487872], [ 1.22474487, 0. , -0.84292723], [-1.22474487, 1.22474487, -0.56195149]]) >>> d_scaled.mean(axis=0) # Standardized dataset with mean value of 0 for each feature column array([0., 0., 0.]) >>> d_scaled.std(axis=0) # Standardized dataset with standard deviation of 1 for each feature column array([1., 1., 1.])
The preprocessing submodule preprocessing also provides a utility class StandardScaler, which stores the mean and standard deviation of the feature columns on the training set so that the same transformations can be applied later on the test set. In addition, the utility class StandardScaler can specify whether or not it is centered and whether or not it is scaled by standard deviation via the with_mean and with_std parameters, which are coded as follows.
>>> import numpy as np >>> from sklearn import preprocessing as pp >>> X_train = ([[ 1., -5., 8.], [ 2., -3., 0.], [ 0., -1., 1.]]) >>> scaler = ().fit(X_train) >>> scaler StandardScaler(copy=True, with_mean=True, with_std=True) >>> scaler.mean_ # Mean value of each feature column of the training set array([ 1., -3., 3.]) >>> scaler.scale_ # Standard deviation of each feature column of the training set array([0.81649658, 1.63299316, 3.55902608]) >>> (X_train) # Standardized training sets array([[ 0. , -1.22474487, 1.40487872], [ 1.22474487, 0. , -0.84292723], [-1.22474487, 1.22474487, -0.56195149]]) >>> X_test = [[-1., 1., 0.]] # Use scaling criteria from the training set to standardize the test set >>> (X_test) array([[-2.44948974, 2.44948974, -0.84292723]])
2 Normalization
Normalization is centering with the mean of the feature column and scaling with the standard deviation. If the data is centered with the minimum value of each feature column of the data set and then scaled by the extreme deviation (maximum - minimum), i.e., the data is subtracted from the minimum value of the feature column and will be converged into the interval [0,1], this process is called data normalization.
The preprocessing submodule of Scikit-learn preprocessing provides the MinMaxScaler class to implement the normalization function.The MinMaxScaler class has an important parameter feature_range, which is used to set the range of the data compression, the default is [0,1].
>>> import numpy as np >>> from sklearn import preprocessing as pp >>> X_train = ([[ 1., -5., 8.], [ 2., -3., 0.], [ 0., -1., 1.]]) >>> scaler = ().fit(X_train) # The default data compression range is [0,1]. >>> scaler MinMaxScaler(copy=True, feature_range=(0, 1)) >>> (X_train) array([[0.5 , 0. , 1. ], [1. , 0.5 , 0. ], [0. , 1. , 0.125]]) >>> scaler = (feature_range=(-2, 2)) # Set the data compression range to [-2,2]. >>> scaler = (X_train) >>> (X_train) array([[ 0. , -2. , 2. ], [ 2. , 0. , -2. ], [-2. , 2. , -1.5]])
Because normalization is very sensitive to outliers, most machine learning algorithms choose normalization for feature scaling. Normalization is often the best choice in algorithms such as Principal Components Analysis (PCA), clustering, logistic regression, support vector machines, and neural networks. Normalization is widely used when distance metrics, gradient, and covariance calculations are not involved, and when the data needs to be compressed into a specific interval, such as when quantifying pixel intensities in digital image processing, where normalization is used to compress the data within the interval [0,1].
3 Regularization
Normalization is an operation on the columns of features of a data set, whereas regularization is a row operation on the rows of a data set by unitizing the paradigms of each data sample. Regularization will be useful if one intends to use operations such as dot product to quantify the similarity between samples.
Scikit-learn's preprocessing submodule preprocessing provides a fast regularization function, normalize(), which can be used to directly return the regularized dataset. normalize() uses the parameter norm to specify the I1 paradigm or I2 paradigm, and by default, the I2 paradigm is used. the I1 paradigm can be interpreted as the sum of the absolute values of the elements of the individual samples is 1; the I2 paradigm can be interpreted as the arithmetic root of the sum of the squares of the elements of the individual samples is 1, which corresponds to the mode (length) of the sample vector. I1 paradigm can be interpreted as a single sample of the sum of the absolute values of the elements of 1; I2 paradigm can be interpreted as a single sample of the elements of the sum of the square of the arithmetic root of 1, equivalent to the mode (length) of the sample vector.
>>> import numpy as np >>> from sklearn import preprocessing as pp >>> X_train = ([[ 1., -5., 8.], [ 2., -3., 0.], [ 0., -1., 1.]]) >>> (X_train) # Regularized using I2 paradigm with 1 paradigm per line array([[ 0.10540926, -0.52704628, 0.84327404], [ 0.5547002 , -0.83205029, 0. ], [ 0. , -0.70710678, 0.70710678]]) >>> (X_train, norm='I1') # Regularized using I1 paradigm, with 1 paradigm per line array([[ 0.07142857, -0.35714286, 0.57142857], [ 0.4 , -0.6 , 0. ], [ 0. , -0.5 , 0.5 ]])
4 Discrete
Discretization is the division of continuous features into discrete eigenvalues, typically applied to binarize grayscale images. If continuous features are discretized using equal-width intervals, it is called K-bins discretization. preprocessing, the pre-processing submodule of Scikit-learn, provides the Binarizer class for binarization and the KbinsDiscretizer class for discretization, which is used in binarization, and the latter is used in K-bins discretization.
>>> import numpy as np >>> from sklearn import preprocessing as pp >>> X = ([[-2,5,11],[7,-1,9],[4,3,7]]) >>> bina = (threshold=5) # Specify a binarization threshold of 5 >>> (X) array([[0, 0, 1], [1, 0, 1], [0, 0, 1]]) >>> est = (n_bins=[2, 2, 3], encode='ordinal').fit(X) >>> (X) # Three feature columns discretized into 2, 2, 3 segments array([[0., 1., 2.], [1., 0., 1.], [1., 1., 0.]])
5 Whitening
The term whitening is translated from whitening, which is difficult to look at and understand only from the effect of whitening. Data whitening has two purposes, one is to remove or reduce the correlation between the feature columns, and the second is to make the variance of each feature column is 1. Obviously, the first goal of whitening is Principal Component Analysis (PCA), which reduces the dimensionality and eliminates feature dimensions with a smaller variance; the second goal of whitening is standardization.
There are two types of whitening, PCA whitening and ZCA whitening. PCA whitening transforms the feature dimensions of the original data to the principal component axes, which eliminates the correlation between the features and makes the variance of each principal component 1. ZCA whitening inverts the results of PCA whitening to the feature dimension axes of the original data because the ZCA whitening process usually does not reduce the dimensionality.
Scikit-learn does not provide a dedicated whitening method, but PCA whitening can be easily implemented by using the PCA class provided by the component analysis submodule decomposition. whiten is a parameter of the PCA class which is used to set whether or not to remove the linear correlation between the features, and the default value is False.
Suppose a girl has a bunch of dating profiles at hand, and the information of each handsome guy consists of several feature items such as age, height, weight, annual salary, number of properties, number of cars, etc.. Through the whitening operation, it is possible to generate a dataset with smaller feature dimensions, and the gap between samples can be directly compared.
>>> import numpy as np >>> from sklearn import preprocessing as pp >>> from import PCA >>> ds = ([ [25, 1.85, 70, 50, 2, 1], [22, 1.78, 72, 22, 0, 1], [26, 1.80, 85, 25, 1, 0], [28, 1.70, 82, 100, 5, 2] ]) # 4 samples, 6 feature columns >>> m = PCA(whiten=True) # Instantiate principal component analysis class, specify whitening parameters >>> (ds) # Principal Component Analysis PCA(whiten=True) >>> d = (ds) # Return the results of the principal component analysis >>> d # Characterization columns reduced from 6 to 4 array([[ 0.01001541, -0.99099492, -1.12597902, -0.03748764], [-0.76359767, -0.5681715 , 1.15935316, 0.67477757], [-0.65589352, 1.26928222, -0.45686577, -1.8639689 ], [ 1.40947578, 0.28988421, 0.42349164, 1.2724972 ]]) >>> (axis=0) # Display the variance of each characteristic column array([0.8660254 , 0.8660254 , 0.8660254 , 1.17790433]) >>> d = (d) # Standardized >>> (axis=0) # Variance of each characteristic column after standardization is 1 array([1., 1., 1., 1.])
Someone on GitHub has provided the code for the ZCA whitening, if needed, visit (/mwv/zca)。
The above is to talk about python machine learning normalization, normalization, regularization, discretization and whitening in detail, more information about python machine learning please follow my other related articles!