SoFunction
Updated on 2024-12-10

python data preprocessing :Sample Uneven Distribution Resolution (Oversampling and Undersampling)

What is an uneven sample distribution:

Sample distribution imbalance means that the sample difference is very large, for example, a total of 1000 data samples in the data set, which occupies 10 samples classified, and its features no matter how you and can not achieve complete feature value coverage, this time belongs to the serious sample distribution imbalance.

Why it is important to address the uneven distribution of samples:

Datasets with unbalanced sample segments are also common: e.g., malicious swipes, scalped orders, credit card fraud, power theft, equipment failures, and loss of customers to large corporations.

Sample imbalance will lead to the classification with a small sample size contains too few features, it is difficult to extract the law from it, even if the classification model is obtained, it is prone to overdependence on a limited number of samples and lead to overfitting problems, when the model is applied to new data, the accuracy and robustness of the model will be poor.

Solutions to Uneven Sample Distribution:

Oversampling Balancing is achieved by increasing the number of samples in classes with fewer samples in the classification, the most direct method is to simply replicate small samples of data, the disadvantage is that if there are fewer features, it can lead to overfitting problems. Improved oversampling methods produce new synthetic samples by adding random noise, disturbing data, or by certain rules to a small number of classes.

Under-sampling Sample equalization is achieved by reducing the number of majority class samples in the classification, the most direct way is to randomly remove some majority class samples to reduce the size of the majority class, the disadvantage is that some important information in the majority class will be lost.

Setting weights Assign different weights to categories with different sample sizes (usually set to be inversely proportional to the sample size)

Integration methods Each time the training set is generated use all the small sample sizes in the classification, and at the same time randomly draw data from the large sample sizes in the classification to merge with the small sample sizes to form the training set, so that many iterations will result in a lot of training sets and training models. Finally, in the application, a combination of methods (e.g., voting, weighted voting, etc.) is used to generate classification predictions. This method is similar to random forest. The disadvantage is that it is more time-consuming and eats up computational resources.

python code:

# Generate unbalanced categorical datasets
from collections import Counter
from  import make_classification
X, y = make_classification(n_samples=3000, n_features=2, n_informative=2,
              n_redundant=0, n_repeated=0, n_classes=3,
              n_clusters_per_class=1,
              weights=[0.1, 0.05, 0.85],
              class_sep=0.8, random_state=2018)
Counter(y)
# Counter({2: 2532, 1: 163, 0: 305})

# Use RandomOverSampler to randomly sample from a small number of classes to add new samples to equalize the categories.
from imblearn.over_sampling import RandomOverSampler
 
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_sample(X, y)
sorted(Counter(y_resampled).items())
# [(0, 2532), (1, 2532), (2, 2532)]

# SMOTE: For a minority sample a, randomly select a nearest-neighbor sample b, and then randomly select a point c from the line connecting a and b as the new minority sample.
from imblearn.over_sampling import SMOTE
 
X_resampled_smote, y_resampled_smote = SMOTE().fit_sample(X, y)
 
sorted(Counter(y_resampled_smote).items())
# [(0, 2532), (1, 2532), (2, 2532)]

# ADASYN: Focuses on generating new minority class samples near those original samples that were misclassified based on the K nearest neighbor classifier
from imblearn.over_sampling import ADASYN

X_resampled_adasyn, y_resampled_adasyn = ADASYN().fit_sample(X, y)
 
sorted(Counter(y_resampled_adasyn).items())
# [(0, 2522), (1, 2520), (2, 2532)]

The # RandomUnderSampler function is a fast and very simple way to balance data across categories: randomly select a subset of the data.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_sample(X, y)
 
sorted(Counter(y_resampled).items())
# [(0, 163), (1, 163), (2, 163)]

# In the previous SMOTE method, when the samples from the boundary are oversampled with other samples, it is easy to generate some noisy data. Therefore, it is necessary to clean the samples after oversampling.
# So that the TomekLink and EditedNearestNeighbours methods fulfill the above requirements.
from  import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_sample(X, y)
 
sorted(Counter(y_resampled).items())
# [(0, 2111), (1, 2099), (2, 1893)]

from  import SMOTETomek
smote_tomek = SMOTETomek(random_state=0)
X_resampled, y_resampled = smote_tomek.fit_sample(X, y)
 
sorted(Counter(y_resampled).items())
# [(0, 2412), (1, 2414), (2, 2396)]

# Use SVM's weight conditioning to deal with unbalanced samples Weights are balanced meaning that the weights are inversely proportional to the amount of data for each classification
from  import SVC 
svm_model = SVC(class_weight='balanced')
svm_model.fit(X, y)

# # EasyEnsemble integrates datasets by randomly downsampling the original dataset.
# EasyEnsemble has two important parameters: (i) n_subsets controls the number of subsets and (ii) replacement determines whether to randomize the samples with or without returns.
from  import EasyEnsemble
ee = EasyEnsemble(random_state=0, n_subsets=10)
X_resampled, y_resampled = ee.fit_sample(X, y)
sorted(Counter(y_resampled[0]).items())
# [(0, 163), (1, 163), (2, 163)]

The # BalanceCascade method ensures that misclassified samples are sampled in the next subset selection by using a classifier (estimator parameter). Similarly, the n_max_subset parameter controls the number of subsets, and bootstrapping can be used by setting bootstrap=True.
from  import BalanceCascade
from sklearn.linear_model import LogisticRegression
bc = BalanceCascade(random_state=0,
          estimator=LogisticRegression(random_state=0),
          n_max_subset=4)
X_resampled, y_resampled = bc.fit_sample(X, y)
 
sorted(Counter(y_resampled[0]).items())
# [(0, 163), (1, 163), (2, 163)]

The # BalancedBaggingClassifier allows resampling each subset before training each base learner. In short, this method combines the results of the EasyEnsemble sampler with a classifier such as the BaggingClassifier.
from  import DecisionTreeClassifier
from  import BalancedBaggingClassifier
bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
                ratio='auto',
                replacement=False,
                random_state=0)
(X, y) 

Above this python data preprocessing :The solution to the uneven distribution of samples (oversampling and undersampling) is all I have to share with you, I hope to be able to give you a reference, and I hope you will support me more.