datapreprocessingIt is an important step before data analysis, or machine learning training.
Through data preprocessing, it is possible to
- Improving data qualityThe company handles missing values, outliers and duplicates of data to increase the accuracy and reliability of data.
- Integration of different dataThe data may come from a variety of sources and structures and are integrated into a single dataset prior to analysis and training
- Improve data performanceThe algorithms can be made more efficient by transforming the values of the data, by statutes, etc. (e.g., dimensionlessness).
This article introduces thedata scalingThe main purpose of the processing is to eliminate the differences in magnitude between different features of the data, so that each feature has the same range of values. This avoids the excessive influence of certain features on the model, which improves the performance of the model.
1. Principles
There are several ways of scaling data, one of which follows aMin-Max Scalingalgorithms are the most commonly used.
The main steps are as follows:
- Calculate the minimum value of the data column (
min
) and the maximum value (max
) - Min-max scaling of each value in a column of data, i.e. converting it to a value within the **[0,1] interval **.
The scaling formula is: new_data=(data-min)/(max-min)
The code to implement scaling is as follows:
# Principles of data scaling implementation data = ([10, 20, 30, 40, 50]) min = (data) max = (data) data_new = (data - min) / (max-min) print("pretreatment: {}".format(data)) print("after treatment: {}".format(data_new)) # Running results pretreatment: [10 20 30 40 50] after treatment: [0. 0.25 0.5 0.75 1. ]
Values are scaled to within the **[0,1] interval**.
This example is just to demonstrate the scaling process, in real scenarios it is better to use thescikit-learn
functions in the library.
scikit-learn
hit the nail on the headminmax_scale
function is a wrapped data scaling function.
from sklearn import preprocessing as pp data = ([10, 20, 30, 40, 50]) pp.minmax_scale(data, feature_range=(0, 1)) # Running results array([0. , 0.25, 0.5 , 0.75, 1. ])
utilizationscikit-learn
hit the nail on the headminmax_scale
function gives the same result, the data is also compressed within the **[0,1] interval **.
the reason whydata scalingThis operation is sometimes referred to as thenormalize。
However.data scalingIt is not necessary to compress the data within the **[0,1] interval**.
By adjustingfeature_range
parameter, you can compress the data to an arbitrary interval.
# Compressed to [0, 1] print(pp.minmax_scale(data, feature_range=(0, 1))) # Compressed to [-1, 1] print(pp.minmax_scale(data, feature_range=(-1, 1))) # Compressed to [0, 5] print(pp.minmax_scale(data, feature_range=(0, 5))) # Running results [0. 0.25 0.5 0.75 1. ] [-1. -0.5 0. 0.5 1. ] [0. 1.25 2.5 3.75 5. ]
2. Role
data scalingThe role of the main ones are:
2.1 Harmonization of data scales
pass (a bill or inspection etc)scalingthat converts data of different magnitudes, scales and units into a harmonized scale.
Avoid distorted or misleading results of data analysis due to inconsistencies in the data scale.
2.2 Enhancing data comparability
pass (a bill or inspection etc)scalingThe conversion of data of different magnitudes, scales, and units into a unified scale makes comparisons between different data easier and more meaningful.
For example, when evaluating the performance of multiple samples, if different measures, scales, and units are used for comparison, it can lead to inaccurate or even misleading comparison results.
This effect can be eliminated after a uniform scaling process, making the comparison more accurate and reliable.
2.3 Enhancing data stability
pass (a bill or inspection etc)scaling, adjusting the range of values of the data to a relatively small interval.
Increase the stability of the data to avoid analytical or computational errors due to too large or too small a range of data distributions.
2.4 Improving algorithmic efficiency and accuracy
pass (a bill or inspection etc)scaling, allowing the efficiency and accuracy of some computational algorithms to be improved.
For example, in neural network algorithms, if the scale of the input data is too large or too small, it will lead to the algorithm training time is too long or too short, and also affect the accuracy and stability of the algorithm.
And after the scaling process, the training time and accuracy of the algorithm can be optimized.
3. Summary
existscikit-learn
library, handling data scaling is not the only thing the aboveMin-Max Scaling,
usableStandardScaler
Perform standardized scaling; useRobustScaler
Implement scale scaling and panning, etc.
When performing data scaling, it is necessary totake note ofOne point is that the scaling process has an effect on theoutlierVery sensitive.
If there is a very large or very smalloutlierWhen it does, there is a risk of corrupting the original data itself.
So, before scaling and processing, it's a good idea to put theoutlierFilter out.
to this article on python sklearn data preprocessing data scaling detailed article is introduced to this, more related sklearn data preprocessing content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future!