SoFunction
Updated on 2024-11-19

Data binning code using python to calculate percentiles

insofar aspercentileI believe we are all familiar with the following explanation source quoted from Baidu Encyclopedia.

Percentile, if a set of data is sorted from smallest to largest and the corresponding cumulative percentile is calculated, the value of the data corresponding to a particular percentile is called the percentile of that percentile. Can be expressed as: a set of nobserved valuecheck or refer tonumerical valueThe sizes are ordered. For example, the value at the p% position is called the pth percentile.

Since percentiles are used to divide the data in equal parts, this method can also be used for equal frequency binning.

import pandas as pd
import numpy as np
import random
t=(columns=['l','s'])
# Randomly generate 1000 integers from 0 to 999
t['l']=[(0,999) for _range in range(1000)]
# Define s as 1 for statistical purposes
t['s']=1
# By finding the locus
l_bin=[]
for i in range(0,101,10):
 l_bin.append((t['l'],i))
# last digit of the quantile plus a very small number, otherwise the cut number 999 will be marked as nan
l_bin[-1]+=1/1e10
print('loci:',(l_bin).round(2))
#Slicing random numbers, left closed right open when right=False
t['box']=(t['l'],l_bin,right=False)
tj=('box')['s'].agg('sum')
print('Boxing statistics')
print(tj)
# Generate new labels
label=[]
for i in range(len(l_bin)-1):
 (str(l_bin[i].round(4))+'+')
Dictionary generation for #original tags and customized new tags
list_box_td=list(set(t['box']))
list_box_td.sort()
dict_t=dict(zip(list_box_td,label))
# Replacement based on dictionary
t['new_box']=t['box'].replace(dict_t)
print('New split-box statistics')
tj=('new_box')['s'].agg('sum')
print(tj)
del t['s']
print(())

Output results:

locus (computing): [ 0. 90.9 194.6 290. 386. 473.5 589. 688. 783.2 884.2
 997. ]
Sub-case statistics
box
[0.0, 90.9)  100
[90.9, 194.6)  100
[194.6, 290.0)  99
[290.0, 386.0)  99
[386.0, 473.5) 102
[473.5, 589.0)  99
[589.0, 688.0) 100
[688.0, 783.2) 101
[783.2, 884.2) 100
[884.2, 997.0) 100
Name: s, dtype: int64
新Sub-case statistics
new_box
0.0+  100
194.6+  99
290.0+  99
386.0+ 102
473.5+  99
589.0+ 100
688.0+ 101
783.2+ 100
884.2+ 100
90.9+  100
Name: s, dtype: int64
  l    box new_box
0 253 [194.6, 290.0) 194.6+
1 468 [386.0, 473.5) 386.0+
2 130 [90.9, 194.6) 90.9+
3 476 [473.5, 589.0) 473.5+
4 656 [589.0, 688.0) 589.0+

You can see that within each sub-box, there are about 100 numbers. Based on this method, some labels can be customized.

Additional extensions: python calculates percentiles for dynamic time points

[Description]

1、Dynamic time points:The data frame for each calculation is the data up to the current row, i.e.cumulative line(Multiple counting);

2. Static time point (current time): the data frame calculated for all lines (one calculation);

[Code]

test = ((1, 10, size=10), columns=['value']) # Generate [1,10] random integers
test['pct_sf'] = (lambda x: [:x].(pct=True)[x]) # Dynamic time points
test['pct'] = (pct=True) # Current point in time
test

Above this use python calculate percentile to realize the data split box code is all I share with you, I hope it can give you a reference, and I hope you support me more.