Overview of sampling methods
Random sampling - small overall numbers
Each sampling unit has the same probability of being drawn and is reproducible.
Random sampling is often used when the total number of individuals is small, and its main characteristic is that it draws from the total number of individuals one by one.
1. Drawing of lots
2. Random number method: random number table, random number dice or computer-generated random numbers.
Stratified sampling - there are differences in the totals and they have an impact on the results
Stratified sampling is a method of sampling in which the whole population is divided into disjoint strata, then a certain number of individuals are drawn independently from each stratum according to a certain proportion, and the individuals taken out of each stratum are combined together as a sample. The smaller the intra-stratum variation, the better, and the larger the inter-stratum variation, the better.
After stratification, simple random sampling is carried out at each stratum, with the number of individuals taken from different groups, in three general ways:
(1) The equal number allocation method, which assigns the same number of individuals to each layer;
(2) Equal-ratio distribution, i.e., letting the ratio of the number of individuals drawn from each stratum to the overall number of individuals in that class be the same;
(3) The optimal allocation method, in which the ratio of the number of samples drawn in each stratum to the total number of samples drawn is equal to the ratio of the variance in that stratum to the sum of the variances in each category.
import pandas as pd import random as rd import numpy as np import math as ma def typeicalSampling(group, typeicalFracDict): name = frac = typeicalFracDict[name] return (frac=frac) def group_sample(data_set,lable,typeicalFracDict): # Stratified sampling #data_set data set #lable hierarchical variable names #typeicalFracDict: categorical sampling proportion gbr=data_set.groupby(by=[lable]) result=data_set.groupby(lable,group_keys=False).apply(typeicalSampling,typeicalFracDict) return result data = ({'id': [3566841, 6541227, 3512441, 3512441, 3512441,3512441, 3512441, 3512441, 3512441, 3512441], 'sex': ['male', 'Female', 'Female','male', 'Female', 'Female','male', 'Female','male', 'Female'], 'level': ['high', 'low', 'middle','high', 'low', 'middle','high', 'low', 'middle','middle']}) data_set=data label='sex' typicalFracDict = { 'male': 0.8, 'Female': 0.2 } result=group_sample(data_set,label,typicalFracDict) print(result)
full sample
Cluster sampling, also known as cluster sampling, is a sampling method in which units in a population are grouped together into a number of non-crossing, non-repeating sets called clusters; the samples are then drawn using the clusters as sampling units.
When applying whole cluster sampling, a good representation of the clusters is required, i.e., the differences between units within the cluster should be large and the differences between clusters should be small.
Implementation steps
The total population is first divided into i clusters, and then a number of clusters are drawn from the i clusters as they occur, and all individuals or units within these clusters are surveyed. The sampling process can be divided into the following steps:
(1) Determining the labeling of subgroups
(2) The totality (N) is divided into a number of non-overlapping parts, each of which is a group.
(3) According to each sample size, determine the number of clusters that should be drawn.
(4) A determined number of clusters from cluster i using simple random sampling or systematic sampling.
Systematic sampling - more overall
Systematic sampling is also known as mechanical sampling and isometric sampling.[4] When the number of individuals in the population is large, simple random sampling is more laborious. [4] When the number of individuals in the population is large, simple random sampling is laborious. In this case, the total can be divided into a balanced number of parts, and then according to the pre-determined rules, from each part of an individual, to get the required sample, this sampling is called systematic sampling. [1]
def SystematicSampling(dataMat,number): length=len(dataMat) k=int(length/number) sample=[] i=0 if k>0 : while len(sample)!=number: (dataMat[0+i*k]) i+=1 return sample else : return RandomSampling(dataMat,number)
oversampling
1、RandomOverSampler
Principle: Randomly draw samples from categories with few samples, and then add the samples from the sampling to the dataset.
Cons: Repeated sampling often leads to severe overfitting
The dominant oversampling method is to artificially synthesize some few class samples in some way to achieve category balance, and the granddaddy of these is SMOTE.
from imblearn.over_sampling import RandomOverSampler ros = RandomOverSampler(sampling_strategy={0: 700,1:200,2:150 },random_state=0) X_resampled, y_resampled = ros.fit_sample(X, y) print(Counter(y_resampled))
2、SMOTE
Principle: Interpolate between minority class samples to generate additional samples. For a minority class sample a, randomly select a nearest-neighbor sample b, andA point c is randomly selected from the line joining a and b as a new minority class sample;
Specifically, for a minority class sample xi use the K-nearest neighbor method (the value of k needs to be specified in advance) to find the k minority class samples that are closest to xi, where distance is defined as the Euclidean distance in the n-dimensional feature space between the samples.
Then one of the k nearest neighbor points is randomly selected to generate a new sample using the following formula:
from imblearn.over_sampling import SMOTE smo = SMOTE(sampling_strategy={0: 700,1:200,2:150 },random_state=42) X_smo, y_smo = smo.fit_sample(X, y) print(Counter(y_smo))
SMOTE randomly selects a few classes of samples to be used to synthesize new samples without considering the surrounding samples, which tends to bring about twoconcern:
1) The newly synthesized sample will not provide much useful information if the selected minority sample is also surrounded by minority samples.
2) If the selected minority class samples are surrounded by majority class samples, which may be noisy, the newly synthesized samples will overlap most of the surrounding majority class samples, making classification difficult.
Overall we want the newly synthesized minority class samples to be near the boundary of the two classes, which often provides enough information for classification. And this is what the Border-line SMOTE algorithm below will do.
3、BorderlineSMOTE
This algorithm will first divide all the minority samples into three categories as shown below:
-
noise
: all k-nearest neighbor samples belong to the majority class -
danger
: more than half of the k-nearest neighbor samples belong to the majority class -
safe
: more than half of the k-nearest neighbor samples belong to the minority class
The Border-line SMOTE algorithm only randomly selects samples from the "danger" state, and then generates new samples using the SMOTE algorithm. Samples in the "danger" state represent a small number of samples near the "border", and samples near the border are more likely to be misclassified. Thus, Border-line SMOTE synthesizes samples only for those minority samples near the "border", while SMOTE treats all minority samples equally.
There are two types of Borderline SMOTE: Borderline-1 SMOTE and Borderline-2 SMOTE. The Borderline-1 SMOTE is synthesized into a sample with x^ in Eq.
is a minority class sample, while x^ in Borderline-2 SMOTE is any sample in the k-nearest neighbors.
from imblearn.over_sampling import BorderlineSMOTE smo = BorderlineSMOTE(kind='borderline-1',sampling_strategy={0: 700,1:200,2:150 },random_state=42) #kind='borderline-2' X_smo, y_smo = smo.fit_sample(X, y) print(Counter(y_smo))
4、ADASYN
Principle: Some mechanism is used to automatically determine how many synthetic samples need to be generated for each minority class sample, rather than synthesizing the same number of samples for each minority class sample as SMOTE does. First determine the number of samples that need to be synthesized for the minority samples (which is positively correlated with the number of majority class samples around the minority samples), and then synthesize the samples using SMOTE.
Cons: The disadvantage of ADASYN is that it is susceptible to outliers; if the K-nearest neighbors of a minority class sample are all majority class samples, its weight becomes quite large, which in turn generates more samples around it.
from imblearn.over_sampling import ADASYN ana = ADASYN(sampling_strategy={0: 800,2:300,1:400 },random_state=0) X_ana, y_ana = ana.fit_sample(X, y)
The samples synthesized with SMOTE are more evenly distributed, while the samples synthesized with Border-line SMOTE are concentrated at the category boundaries.The property of ADASYN is that the more samples of the majority category there are around a sample of a minority category, the more samples the algorithm generates for it, and it can be seen that most of the generated samples come from the minority category samples that were originally nearer to the majority category.
5、KMeansSMOTE
Principle: Apply KMeans clustering before oversampling using SMOTE.
KMeansSMOTE consists of three steps: clustering, filtering and oversampling. In the clustering step, k-means are used to cluster into k groups. Filtering selects clusters for oversampling and retains clusters with a high proportion of minority class samples. It then allocates the number of synthetic samples, assigning more samples to clusters with a sparse distribution of minority samples. Finally, the oversampling step, applies SMOTE in each selected cluster to achieve the target ratio of minority and majority instances.
from imblearn.over_sampling import KMeansSMOTE kms = KMeansSMOTE(sampling_strategy={0: 800,2:300,1:400 },random_state=42) X_kms, y_kms = kms.fit_sample(X, y) print(Counter(y_kms))
6、SMOTENC
Handling of categorical featuresSMOTE
from imblearn.over_sampling import SMOTENC sm = SMOTENC(random_state=42, categorical_features=[18, 19])
7、SVMSMOTE
Support vectors are generated using a support vector machine classifier then new minority class samples are generated and then the samples are synthesized using SMOTE
from imblearn.over_sampling import SVMSMOTE svmm = SVMSMOTE(sampling_strategy={0: 800,2:300,1:400 },random_state=42) X_svmm, y_svmm = svmm.fit_sample(X, y) print(Counter(y_kms))
downsampling
1、RandomUnderSampler(number of undersamples can be controlled)
Principle: Randomly select some of the samples from the majority class to cull out.
Disadvantage: The rejected samples may contain some important information, resulting in a poorly learned model.
from imblearn.under_sampling import RandomUnderSampler cc = RandomUnderSampler(sampling_strategy={0: 50,2:100,1:100 },random_state=0) X_resampled, y_resampled = cc.fit_sample(X, y) print(sorted(Counter(y_resampled).items()))
2、NearMiss(number of undersamples can be controlled)
Principle: The most representative samples from most classes are selected for training, mainly to alleviate the problem of information loss in random undersampling.
NearMiss uses some heuristic rules to select the samples, which can be categorized into 3 classes according to the rules,which are determined by setting the version parameter:
-
NearMiss-1
: Selection to the nearest K minority class samples averaged over the nearest majority class samples -
NearMiss-2
: Selection to the furthest K minority class samples averaged over the closest majority class samples -
NearMiss-3
: For each minority class sample choose the K nearest majority class samples with the aim of ensuring that each minority class sample is surrounded by majority class samples
NearMiss-1 and NearMiss-2 have a high computational overhead because of the need to compute the K nearest neighbor points for each multi-category sample. In addition, NearMiss-1 is susceptible to outlier points, the
from imblearn.under_sampling import NearMiss nm1 = NearMiss(sampling_strategy={0: 50,2:100,1:100 },random_state=0, version=1) X_resampled_nm1, y_resampled = nm1.fit_sample(X, y) print(sorted(Counter(y_resampled).items()))
3、ClusterCentroids(number of undersamples can be controlled)
Principle: Using kmeans will cluster the samples of each category separately, using the center of mass to replace the entire cluster of samples.
from imblearn.under_sampling import ClusterCentroids cc = ClusterCentroids(sampling_strategy={0: 700,1:100,2:90 },random_state=0) X_resampled, y_resampled = cc.fit_sample(X, y) print(sorted(Counter(y_resampled).items()))
4、TomekLinks(Data cleansing methodology with no control over the number of undersamples)
Principle: A Tomek Link represents a pair of samples that are closest to each other in terms of distance between different categories, i.e., the two samples are each other's nearest neighbors and belong to different categories. So if two samples form a Tomek Link, either one of them is noise or both samples are near the boundary. By removing the Tomek Link, the overlapping samples can be "cleaned" so that the nearest neighbor samples all belong to the same category and can be better classified.
from imblearn.under_sampling import TomekLinks nm1 = TomekLinks(sampling_strategy='all',random_state=0) X_resampled_nm1, y_resampled = nm1.fit_sample(X, y) print(sorted(Counter(y_resampled).items()))
The auto parameter in the TomekLinks function controls which samples are removed from Tomek's links. By default, ratio='auto' removes samples from most classes, and when ratio='all', both samples are removed.
5、EditedNearestNeighbours(Data cleansing methodology with no control over the number of undersamples)
Principle: For a sample belonging to the majority class, if more than half (kind_sel='mode') or all (kind_sel='all') of its K nearest neighbor points do not belong to the majority class, this sample is eliminated. .
from imblearn.under_sampling import EditedNearestNeighbours renn = EditedNearestNeighbours(kind_sel='all') X_res, y_res = renn.fit_resample(X, y) print(sorted(Counter(y_res).items()))
6、RepeatedEditedNearestNeighbours (Data cleansing methodology with no control over the number of undersamples)
Principle: repeat EditedNearestNeighbours many times (parameter max_iter controls the number of iterations)
#downsamplingRepeatedEditedNearestNeighboursconnector from imblearn.under_sampling import RepeatedEditedNearestNeighbours renn = RepeatedEditedNearestNeighbours(kind_sel='all',max_iter=101) X_res, y_res = renn.fit_resample(X, y) print(sorted(Counter(y_res).items()))
7、ALLKNN(Data cleansing methodology with no control over the number of undersamples)
from imblearn.under_sampling import AllKNN renn = AllKNN(kind_sel='all') X_res, y_res = renn.fit_resample(X, y) print(sorted(Counter(y_res).items()))
8、CondensedNearestNeighbour (Data cleansing methodology with no control over the number of undersamples)
Use the nearest neighbor method to iterate, to determine whether a sample should be retained or excluded, the specific implementation steps are as follows.
1) Set C: all minority class samples.
2) Select a majority class sample (which needs to be downsampled) to be added to set C, and other such samples to be put into set S; and
3) train a 1-NN classifier using the set S to classify the samples in the set S; the
4) Add the misclassified samples from set S to set C; and
5) Repeat the above process until no more samples are added to set C.
from imblearn.under_sampling import CondensedNearestNeighbour renn = CondensedNearestNeighbour(random_state=0) X_res, y_res = renn.fit_resample(X, y) print(sorted(Counter(y_res).items()))
The CondensedNearestNeighbour method is sensitive to noisy data, and it is easy to add noisy data to the set C. The method is also sensitive to noise.
9、OneSidedSelection (Data cleansing methodology with no control over the number of undersamples)
Principle: Use the TomekLinks method based on CondensedNearestNeighbour to eliminate noisy data (majority class samples).
from imblearn.under_sampling import OneSidedSelection oss = OneSidedSelection(random_state=0) X_resampled, y_resampled = oss.fit_sample(X, y) print(sorted(Counter(y_resampled).items()))
10、NeighbourhoodCleaningRule (Data cleansing methodology with no control over the number of undersamples)
from sklearn.linear_model import LogisticRegression from imblearn.under_sampling import InstanceHardnessThreshold iht = InstanceHardnessThreshold(random_state=0, estimator=LogisticRegression()) X_resampled, y_resampled = iht.fit_sample(X, y) print(sorted(Counter(y_resampled).items()))
11、InstanceHardnessThreshold(Data cleansing methodology with no control over the number of undersamples)
A classifier is applied to the data and samples with probabilities below a threshold are eliminated.
from sklearn.linear_model import LogisticRegression from imblearn.under_sampling import InstanceHardnessThreshold iht = InstanceHardnessThreshold(random_state=0, estimator=LogisticRegression()) X_resampled, y_resampled = iht.fit_sample(X, y) print(sorted(Counter(y_resampled).items()))
12、EasyEnsemble(Controllable quantity)
Randomly sample from the majority class sample into a subset whose number equals the number of minority class samples. This subset is then combined with the minority class samples to train a model for n iterations. In this way, although each subset has fewer samples than the overall sample, the total amount of information is not reduced after integration.
from import EasyEnsemble ee = EasyEnsemble(sampling_strategy={0: 500,1:199,2:89 },random_state=0, n_subsets=10) X_resampled, y_resampled = ee.fit_sample(X, y) print(X_resampled.shape) print(y_resampled.shape) print(sorted(Counter(y_resampled[0]).items()))
There are two very important parameters:
(i) n_subsets controls the number of subsets.
(ii) replacement determines random sampling with or without playback.
13、BalanceCascade(controllable quantities)
In the nth round of training, a subset of samples from the majority class is combined with the minority class samples to train a base learner H. After training, the samples in the majority class that can be correctly categorized by H are eliminated. In the next n+1th round, a subset of the rejected majority class samples is generated for training in combination with the minority class samples.
Similarly, the n_max_subset parameter controls the number of subsets, and bootstrapping can be used by setting bootstrap=True.
from import BalanceCascade from sklearn.linear_model import LogisticRegression bc = BalanceCascade(random_state=0, estimator=LogisticRegression(random_state=0), n_max_subset=4) X_resampled, y_resampled = bc.fit_sample(X, y) print(X_resampled.shape) print(sorted(Counter(y_resampled[0]).items()))
Combined oversampling and downsampling
The disadvantage of the SMOTE algorithm is that the generated minority samples tend to overlap with the surrounding majority samples and are difficult to categorize, while the data cleaning technique can precisely deal with the overlapping samples, so it is possible to combine the two to form a pipeline, which is over-sampled and then cleaned. The main methods are SMOTE + ENN and SMOTE + Tomek, where SMOTE + ENN usually removes more overlapping samples.
1、SMOTEENN
from import SMOTEENN smote_enn = SMOTEENN(random_state=0) X_resampled, y_resampled = smote_enn.fit_sample(X, y) print(sorted(Counter(y_resampled).items()))
2、 SMOTETomek
from import SMOTETomek smote_tomek = SMOTETomek(sampling_strategy={0: 700,1:300,2:200 },random_state=0) X_resampled, y_resampled = smote_tomek.fit_sample(X, y) print(sorted(Counter(y_resampled).items()))
summarize
The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.