python sampling method explained and implementation process

Overview of sampling methods

Random sampling - small overall numbers

Each sampling unit has the same probability of being drawn and is reproducible.

Random sampling is often used when the total number of individuals is small, and its main characteristic is that it draws from the total number of individuals one by one.

1. Drawing of lots

2. Random number method: random number table, random number dice or computer-generated random numbers.

Stratified sampling - there are differences in the totals and they have an impact on the results

Stratified sampling is a method of sampling in which the whole population is divided into disjoint strata, then a certain number of individuals are drawn independently from each stratum according to a certain proportion, and the individuals taken out of each stratum are combined together as a sample. The smaller the intra-stratum variation, the better, and the larger the inter-stratum variation, the better.

After stratification, simple random sampling is carried out at each stratum, with the number of individuals taken from different groups, in three general ways:

(1) The equal number allocation method, which assigns the same number of individuals to each layer;

(2) Equal-ratio distribution, i.e., letting the ratio of the number of individuals drawn from each stratum to the overall number of individuals in that class be the same;

(3) The optimal allocation method, in which the ratio of the number of samples drawn in each stratum to the total number of samples drawn is equal to the ratio of the variance in that stratum to the sum of the variances in each category.

import  pandas as pd
import random as rd
import numpy as np
import math as ma

def typeicalSampling(group, typeicalFracDict):
    name = 
    frac = typeicalFracDict[name]
    return (frac=frac)

def group_sample(data_set,lable,typeicalFracDict):
    # Stratified sampling
    #data_set data set
    #lable hierarchical variable names
    #typeicalFracDict: categorical sampling proportion
    gbr=data_set.groupby(by=[lable])
    result=data_set.groupby(lable,group_keys=False).apply(typeicalSampling,typeicalFracDict)
    return result

data = ({'id': [3566841, 6541227, 3512441, 3512441, 3512441,3512441, 3512441, 3512441, 3512441, 3512441],
                   'sex': ['male', 'Female', 'Female','male', 'Female', 'Female','male', 'Female','male', 'Female'],
                   'level': ['high', 'low', 'middle','high', 'low', 'middle','high', 'low', 'middle','middle']})

data_set=data
label='sex'
typicalFracDict = {
    'male': 0.8,
    'Female': 0.2
}
result=group_sample(data_set,label,typicalFracDict)
print(result)

full sample

Cluster sampling, also known as cluster sampling, is a sampling method in which units in a population are grouped together into a number of non-crossing, non-repeating sets called clusters; the samples are then drawn using the clusters as sampling units.

When applying whole cluster sampling, a good representation of the clusters is required, i.e., the differences between units within the cluster should be large and the differences between clusters should be small.

Implementation steps

The total population is first divided into i clusters, and then a number of clusters are drawn from the i clusters as they occur, and all individuals or units within these clusters are surveyed. The sampling process can be divided into the following steps:

(1) Determining the labeling of subgroups

(2) The totality (N) is divided into a number of non-overlapping parts, each of which is a group.

(3) According to each sample size, determine the number of clusters that should be drawn.

(4) A determined number of clusters from cluster i using simple random sampling or systematic sampling.

Systematic sampling - more overall

Systematic sampling is also known as mechanical sampling and isometric sampling.[4] When the number of individuals in the population is large, simple random sampling is more laborious. [4] When the number of individuals in the population is large, simple random sampling is laborious. In this case, the total can be divided into a balanced number of parts, and then according to the pre-determined rules, from each part of an individual, to get the required sample, this sampling is called systematic sampling. [1]

def SystematicSampling(dataMat,number):    
       length=len(dataMat)
       k=int(length/number)
       sample=[]     
       i=0
       if k>0 :       
         while len(sample)!=number:
            (dataMat[0+i*k])
            i+=1            
         return sample
       else :
         return RandomSampling(dataMat,number)

oversampling

1、RandomOverSampler

Principle: Randomly draw samples from categories with few samples, and then add the samples from the sampling to the dataset.

Cons: Repeated sampling often leads to severe overfitting

The dominant oversampling method is to artificially synthesize some few class samples in some way to achieve category balance, and the granddaddy of these is SMOTE.

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy={0: 700,1:200,2:150 },random_state=0)
X_resampled, y_resampled = ros.fit_sample(X, y)
print(Counter(y_resampled))

2、SMOTE

Principle: Interpolate between minority class samples to generate additional samples. For a minority class sample a, randomly select a nearest-neighbor sample b, andA point c is randomly selected from the line joining a and b as a new minority class sample;

Specifically, for a minority class sample xi use the K-nearest neighbor method (the value of k needs to be specified in advance) to find the k minority class samples that are closest to xi, where distance is defined as the Euclidean distance in the n-dimensional feature space between the samples.

Then one of the k nearest neighbor points is randomly selected to generate a new sample using the following formula:

from imblearn.over_sampling import SMOTE
smo = SMOTE(sampling_strategy={0: 700,1:200,2:150 },random_state=42)
X_smo, y_smo = smo.fit_sample(X, y)
print(Counter(y_smo))

SMOTE randomly selects a few classes of samples to be used to synthesize new samples without considering the surrounding samples, which tends to bring about twoconcern：

1) The newly synthesized sample will not provide much useful information if the selected minority sample is also surrounded by minority samples.

2) If the selected minority class samples are surrounded by majority class samples, which may be noisy, the newly synthesized samples will overlap most of the surrounding majority class samples, making classification difficult.

Overall we want the newly synthesized minority class samples to be near the boundary of the two classes, which often provides enough information for classification. And this is what the Border-line SMOTE algorithm below will do.

3、BorderlineSMOTE

This algorithm will first divide all the minority samples into three categories as shown below:

noise: all k-nearest neighbor samples belong to the majority class
danger: more than half of the k-nearest neighbor samples belong to the majority class
safe: more than half of the k-nearest neighbor samples belong to the minority class

The Border-line SMOTE algorithm only randomly selects samples from the "danger" state, and then generates new samples using the SMOTE algorithm. Samples in the "danger" state represent a small number of samples near the "border", and samples near the border are more likely to be misclassified. Thus, Border-line SMOTE synthesizes samples only for those minority samples near the "border", while SMOTE treats all minority samples equally.

There are two types of Borderline SMOTE: Borderline-1 SMOTE and Borderline-2 SMOTE. The Borderline-1 SMOTE is synthesized into a sample with x^ in Eq.

is a minority class sample, while x^ in Borderline-2 SMOTE is any sample in the k-nearest neighbors.

from imblearn.over_sampling import BorderlineSMOTE
smo = BorderlineSMOTE(kind='borderline-1',sampling_strategy={0: 700,1:200,2:150 },random_state=42) #kind='borderline-2'
X_smo, y_smo = smo.fit_sample(X, y)
print(Counter(y_smo))

4、ADASYN

Principle: Some mechanism is used to automatically determine how many synthetic samples need to be generated for each minority class sample, rather than synthesizing the same number of samples for each minority class sample as SMOTE does. First determine the number of samples that need to be synthesized for the minority samples (which is positively correlated with the number of majority class samples around the minority samples), and then synthesize the samples using SMOTE.

Cons: The disadvantage of ADASYN is that it is susceptible to outliers; if the K-nearest neighbors of a minority class sample are all majority class samples, its weight becomes quite large, which in turn generates more samples around it.

from imblearn.over_sampling import ADASYN
ana = ADASYN(sampling_strategy={0: 800,2:300,1:400 },random_state=0)
X_ana, y_ana = ana.fit_sample(X, y)

The samples synthesized with SMOTE are more evenly distributed, while the samples synthesized with Border-line SMOTE are concentrated at the category boundaries.The property of ADASYN is that the more samples of the majority category there are around a sample of a minority category, the more samples the algorithm generates for it, and it can be seen that most of the generated samples come from the minority category samples that were originally nearer to the majority category.

5、KMeansSMOTE

Principle: Apply KMeans clustering before oversampling using SMOTE.

KMeansSMOTE consists of three steps: clustering, filtering and oversampling. In the clustering step, k-means are used to cluster into k groups. Filtering selects clusters for oversampling and retains clusters with a high proportion of minority class samples. It then allocates the number of synthetic samples, assigning more samples to clusters with a sparse distribution of minority samples. Finally, the oversampling step, applies SMOTE in each selected cluster to achieve the target ratio of minority and majority instances.

from imblearn.over_sampling import KMeansSMOTE
kms = KMeansSMOTE(sampling_strategy={0: 800,2:300,1:400 },random_state=42)
X_kms, y_kms = kms.fit_sample(X, y)
print(Counter(y_kms))

6、SMOTENC

Handling of categorical featuresSMOTE

from imblearn.over_sampling import SMOTENC
sm = SMOTENC(random_state=42, categorical_features=[18, 19])

7、SVMSMOTE

Support vectors are generated using a support vector machine classifier then new minority class samples are generated and then the samples are synthesized using SMOTE

from imblearn.over_sampling import SVMSMOTE
svmm = SVMSMOTE(sampling_strategy={0: 800,2:300,1:400 },random_state=42)
X_svmm, y_svmm = svmm.fit_sample(X, y)
print(Counter(y_kms))

downsampling

1、RandomUnderSampler(number of undersamples can be controlled)

Principle: Randomly select some of the samples from the majority class to cull out.

Disadvantage: The rejected samples may contain some important information, resulting in a poorly learned model.

from imblearn.under_sampling import RandomUnderSampler
cc = RandomUnderSampler(sampling_strategy={0: 50,2:100,1:100 },random_state=0)
X_resampled, y_resampled = cc.fit_sample(X, y)
print(sorted(Counter(y_resampled).items()))

2、NearMiss(number of undersamples can be controlled)

Principle: The most representative samples from most classes are selected for training, mainly to alleviate the problem of information loss in random undersampling.

NearMiss uses some heuristic rules to select the samples, which can be categorized into 3 classes according to the rules,which are determined by setting the version parameter:

NearMiss-1: Selection to the nearest K minority class samples averaged over the nearest majority class samples
NearMiss-2: Selection to the furthest K minority class samples averaged over the closest majority class samples
NearMiss-3: For each minority class sample choose the K nearest majority class samples with the aim of ensuring that each minority class sample is surrounded by majority class samples

NearMiss-1 and NearMiss-2 have a high computational overhead because of the need to compute the K nearest neighbor points for each multi-category sample. In addition, NearMiss-1 is susceptible to outlier points, the

from imblearn.under_sampling import NearMiss
nm1 = NearMiss(sampling_strategy={0: 50,2:100,1:100 },random_state=0, version=1)
X_resampled_nm1, y_resampled = nm1.fit_sample(X, y)
print(sorted(Counter(y_resampled).items()))

3、ClusterCentroids(number of undersamples can be controlled)

Principle: Using kmeans will cluster the samples of each category separately, using the center of mass to replace the entire cluster of samples.

from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(sampling_strategy={0: 700,1:100,2:90 },random_state=0)
X_resampled, y_resampled = cc.fit_sample(X, y)
print(sorted(Counter(y_resampled).items()))

4、TomekLinks(Data cleansing methodology with no control over the number of undersamples)

Principle: A Tomek Link represents a pair of samples that are closest to each other in terms of distance between different categories, i.e., the two samples are each other's nearest neighbors and belong to different categories. So if two samples form a Tomek Link, either one of them is noise or both samples are near the boundary. By removing the Tomek Link, the overlapping samples can be "cleaned" so that the nearest neighbor samples all belong to the same category and can be better classified.

from imblearn.under_sampling import TomekLinks
nm1 = TomekLinks(sampling_strategy='all',random_state=0)
X_resampled_nm1, y_resampled = nm1.fit_sample(X, y)
print(sorted(Counter(y_resampled).items()))

The auto parameter in the TomekLinks function controls which samples are removed from Tomek's links. By default, ratio='auto' removes samples from most classes, and when ratio='all', both samples are removed.

5、EditedNearestNeighbours(Data cleansing methodology with no control over the number of undersamples)

Principle: For a sample belonging to the majority class, if more than half (kind_sel='mode') or all (kind_sel='all') of its K nearest neighbor points do not belong to the majority class, this sample is eliminated. .

from imblearn.under_sampling import EditedNearestNeighbours
renn = EditedNearestNeighbours(kind_sel='all')
X_res, y_res = renn.fit_resample(X, y)
print(sorted(Counter(y_res).items()))

6、RepeatedEditedNearestNeighbours (Data cleansing methodology with no control over the number of undersamples)

Principle: repeat EditedNearestNeighbours many times (parameter max_iter controls the number of iterations)

#downsamplingRepeatedEditedNearestNeighboursconnector
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
renn = RepeatedEditedNearestNeighbours(kind_sel='all',max_iter=101)
X_res, y_res = renn.fit_resample(X, y)
print(sorted(Counter(y_res).items()))

7、ALLKNN(Data cleansing methodology with no control over the number of undersamples)

from imblearn.under_sampling import AllKNN
renn = AllKNN(kind_sel='all')
X_res, y_res = renn.fit_resample(X, y)
print(sorted(Counter(y_res).items()))

8、CondensedNearestNeighbour (Data cleansing methodology with no control over the number of undersamples)

Use the nearest neighbor method to iterate, to determine whether a sample should be retained or excluded, the specific implementation steps are as follows.

1) Set C: all minority class samples.

2) Select a majority class sample (which needs to be downsampled) to be added to set C, and other such samples to be put into set S; and

3) train a 1-NN classifier using the set S to classify the samples in the set S; the

4) Add the misclassified samples from set S to set C; and

5) Repeat the above process until no more samples are added to set C.

from imblearn.under_sampling import CondensedNearestNeighbour
renn = CondensedNearestNeighbour(random_state=0)
X_res, y_res = renn.fit_resample(X, y)
print(sorted(Counter(y_res).items()))

The CondensedNearestNeighbour method is sensitive to noisy data, and it is easy to add noisy data to the set C. The method is also sensitive to noise.

9、OneSidedSelection (Data cleansing methodology with no control over the number of undersamples)

Principle: Use the TomekLinks method based on CondensedNearestNeighbour to eliminate noisy data (majority class samples).

from imblearn.under_sampling import OneSidedSelection
oss = OneSidedSelection(random_state=0)
X_resampled, y_resampled = oss.fit_sample(X, y)
print(sorted(Counter(y_resampled).items()))

10、NeighbourhoodCleaningRule (Data cleansing methodology with no control over the number of undersamples)

from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import InstanceHardnessThreshold
iht = InstanceHardnessThreshold(random_state=0,
                                estimator=LogisticRegression())
X_resampled, y_resampled = iht.fit_sample(X, y)
print(sorted(Counter(y_resampled).items()))

11、InstanceHardnessThreshold(Data cleansing methodology with no control over the number of undersamples)

A classifier is applied to the data and samples with probabilities below a threshold are eliminated.

from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import InstanceHardnessThreshold
iht = InstanceHardnessThreshold(random_state=0,
                                estimator=LogisticRegression())
X_resampled, y_resampled = iht.fit_sample(X, y)
print(sorted(Counter(y_resampled).items()))

12、EasyEnsemble(Controllable quantity)

Randomly sample from the majority class sample into a subset whose number equals the number of minority class samples. This subset is then combined with the minority class samples to train a model for n iterations. In this way, although each subset has fewer samples than the overall sample, the total amount of information is not reduced after integration.

from  import EasyEnsemble
ee = EasyEnsemble(sampling_strategy={0: 500,1:199,2:89 },random_state=0, n_subsets=10)
X_resampled, y_resampled = ee.fit_sample(X, y)
print(X_resampled.shape)
print(y_resampled.shape)
print(sorted(Counter(y_resampled[0]).items()))

There are two very important parameters:

(i) n_subsets controls the number of subsets.

(ii) replacement determines random sampling with or without playback.

13、BalanceCascade(controllable quantities)

In the nth round of training, a subset of samples from the majority class is combined with the minority class samples to train a base learner H. After training, the samples in the majority class that can be correctly categorized by H are eliminated. In the next n+1th round, a subset of the rejected majority class samples is generated for training in combination with the minority class samples.

Similarly, the n_max_subset parameter controls the number of subsets, and bootstrapping can be used by setting bootstrap=True.

from  import BalanceCascade
from sklearn.linear_model import LogisticRegression
bc = BalanceCascade(random_state=0,
                    estimator=LogisticRegression(random_state=0),
                    n_max_subset=4)
X_resampled, y_resampled = bc.fit_sample(X, y)
print(X_resampled.shape)
print(sorted(Counter(y_resampled[0]).items()))

Combined oversampling and downsampling

The disadvantage of the SMOTE algorithm is that the generated minority samples tend to overlap with the surrounding majority samples and are difficult to categorize, while the data cleaning technique can precisely deal with the overlapping samples, so it is possible to combine the two to form a pipeline, which is over-sampled and then cleaned. The main methods are SMOTE + ENN and SMOTE + Tomek, where SMOTE + ENN usually removes more overlapping samples.

1、SMOTEENN

from  import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_sample(X, y)

print(sorted(Counter(y_resampled).items()))

2、 SMOTETomek

from  import SMOTETomek
smote_tomek = SMOTETomek(sampling_strategy={0: 700,1:300,2:200 },random_state=0)
X_resampled, y_resampled = smote_tomek.fit_sample(X, y)
print(sorted(Counter(y_resampled).items()))

summarize

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.