SoFunction
Updated on 2024-11-18

python sklearn data preprocessing data scaling details

datapreprocessingIt is an important step before data analysis, or machine learning training.

Through data preprocessing, it is possible to

  • Improving data qualityThe company handles missing values, outliers and duplicates of data to increase the accuracy and reliability of data.
  • Integration of different dataThe data may come from a variety of sources and structures and are integrated into a single dataset prior to analysis and training
  • Improve data performanceThe algorithms can be made more efficient by transforming the values of the data, by statutes, etc. (e.g., dimensionlessness).

This article introduces thedata scalingThe main purpose of the processing is to eliminate the differences in magnitude between different features of the data, so that each feature has the same range of values. This avoids the excessive influence of certain features on the model, which improves the performance of the model.

1. Principles

There are several ways of scaling data, one of which follows aMin-Max Scalingalgorithms are the most commonly used.
The main steps are as follows:

  • Calculate the minimum value of the data column (min) and the maximum value (max
  • Min-max scaling of each value in a column of data, i.e. converting it to a value within the **[0,1] interval **.

The scaling formula is: new_data=(data-min)/(max-min)

The code to implement scaling is as follows:

# Principles of data scaling implementation
data = ([10, 20, 30, 40, 50])
min = (data)
max = (data)
data_new = (data - min) / (max-min)
print("pretreatment: {}".format(data))
print("after treatment: {}".format(data_new))
# Running results
pretreatment: [10 20 30 40 50]
after treatment: [0.   0.25 0.5  0.75 1.  ]

Values are scaled to within the **[0,1] interval**.

This example is just to demonstrate the scaling process, in real scenarios it is better to use thescikit-learnfunctions in the library.

scikit-learnhit the nail on the headminmax_scalefunction is a wrapped data scaling function.

from sklearn import preprocessing as pp
data = ([10, 20, 30, 40, 50])
pp.minmax_scale(data, feature_range=(0, 1))
# Running results
array([0.  , 0.25, 0.5 , 0.75, 1.  ])

utilizationscikit-learnhit the nail on the headminmax_scalefunction gives the same result, the data is also compressed within the **[0,1] interval **.

the reason whydata scalingThis operation is sometimes referred to as thenormalize

However.data scalingIt is not necessary to compress the data within the **[0,1] interval**.

By adjustingfeature_rangeparameter, you can compress the data to an arbitrary interval.

# Compressed to [0, 1]
print(pp.minmax_scale(data, feature_range=(0, 1)))
# Compressed to [-1, 1]
print(pp.minmax_scale(data, feature_range=(-1, 1)))
# Compressed to [0, 5]
print(pp.minmax_scale(data, feature_range=(0, 5)))
# Running results
[0.   0.25 0.5  0.75 1.  ]
[-1.  -0.5  0.   0.5  1. ]
[0.   1.25 2.5  3.75 5.  ]

2. Role

data scalingThe role of the main ones are:

2.1 Harmonization of data scales

pass (a bill or inspection etc)scalingthat converts data of different magnitudes, scales and units into a harmonized scale.

Avoid distorted or misleading results of data analysis due to inconsistencies in the data scale.

2.2 Enhancing data comparability

pass (a bill or inspection etc)scalingThe conversion of data of different magnitudes, scales, and units into a unified scale makes comparisons between different data easier and more meaningful.

For example, when evaluating the performance of multiple samples, if different measures, scales, and units are used for comparison, it can lead to inaccurate or even misleading comparison results.

This effect can be eliminated after a uniform scaling process, making the comparison more accurate and reliable.

2.3 Enhancing data stability

pass (a bill or inspection etc)scaling, adjusting the range of values of the data to a relatively small interval.

Increase the stability of the data to avoid analytical or computational errors due to too large or too small a range of data distributions.

2.4 Improving algorithmic efficiency and accuracy

pass (a bill or inspection etc)scaling, allowing the efficiency and accuracy of some computational algorithms to be improved.

For example, in neural network algorithms, if the scale of the input data is too large or too small, it will lead to the algorithm training time is too long or too short, and also affect the accuracy and stability of the algorithm.

And after the scaling process, the training time and accuracy of the algorithm can be optimized.

3. Summary

existscikit-learnlibrary, handling data scaling is not the only thing the aboveMin-Max Scaling

usableStandardScalerPerform standardized scaling; useRobustScalerImplement scale scaling and panning, etc.

When performing data scaling, it is necessary totake note ofOne point is that the scaling process has an effect on theoutlierVery sensitive.

If there is a very large or very smalloutlierWhen it does, there is a risk of corrupting the original data itself.

So, before scaling and processing, it's a good idea to put theoutlierFilter out.

to this article on python sklearn data preprocessing data scaling detailed article is introduced to this, more related sklearn data preprocessing content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future!