insofar aspercentileI believe we are all familiar with the following explanation source quoted from Baidu Encyclopedia.
Percentile, if a set of data is sorted from smallest to largest and the corresponding cumulative percentile is calculated, the value of the data corresponding to a particular percentile is called the percentile of that percentile. Can be expressed as: a set of nobserved valuecheck or refer tonumerical valueThe sizes are ordered. For example, the value at the p% position is called the pth percentile.
Since percentiles are used to divide the data in equal parts, this method can also be used for equal frequency binning.
import pandas as pd import numpy as np import random t=(columns=['l','s']) # Randomly generate 1000 integers from 0 to 999 t['l']=[(0,999) for _range in range(1000)] # Define s as 1 for statistical purposes t['s']=1 # By finding the locus l_bin=[] for i in range(0,101,10): l_bin.append((t['l'],i)) # last digit of the quantile plus a very small number, otherwise the cut number 999 will be marked as nan l_bin[-1]+=1/1e10 print('loci:',(l_bin).round(2)) #Slicing random numbers, left closed right open when right=False t['box']=(t['l'],l_bin,right=False) tj=('box')['s'].agg('sum') print('Boxing statistics') print(tj) # Generate new labels label=[] for i in range(len(l_bin)-1): (str(l_bin[i].round(4))+'+') Dictionary generation for #original tags and customized new tags list_box_td=list(set(t['box'])) list_box_td.sort() dict_t=dict(zip(list_box_td,label)) # Replacement based on dictionary t['new_box']=t['box'].replace(dict_t) print('New split-box statistics') tj=('new_box')['s'].agg('sum') print(tj) del t['s'] print(())
Output results:
locus (computing): [ 0. 90.9 194.6 290. 386. 473.5 589. 688. 783.2 884.2 997. ] Sub-case statistics box [0.0, 90.9) 100 [90.9, 194.6) 100 [194.6, 290.0) 99 [290.0, 386.0) 99 [386.0, 473.5) 102 [473.5, 589.0) 99 [589.0, 688.0) 100 [688.0, 783.2) 101 [783.2, 884.2) 100 [884.2, 997.0) 100 Name: s, dtype: int64 新Sub-case statistics new_box 0.0+ 100 194.6+ 99 290.0+ 99 386.0+ 102 473.5+ 99 589.0+ 100 688.0+ 101 783.2+ 100 884.2+ 100 90.9+ 100 Name: s, dtype: int64 l box new_box 0 253 [194.6, 290.0) 194.6+ 1 468 [386.0, 473.5) 386.0+ 2 130 [90.9, 194.6) 90.9+ 3 476 [473.5, 589.0) 473.5+ 4 656 [589.0, 688.0) 589.0+
You can see that within each sub-box, there are about 100 numbers. Based on this method, some labels can be customized.
Additional extensions: python calculates percentiles for dynamic time points
[Description]
1、Dynamic time points:The data frame for each calculation is the data up to the current row, i.e.cumulative line(Multiple counting);
2. Static time point (current time): the data frame calculated for all lines (one calculation);
[Code]
test = ((1, 10, size=10), columns=['value']) # Generate [1,10] random integers test['pct_sf'] = (lambda x: [:x].(pct=True)[x]) # Dynamic time points test['pct'] = (pct=True) # Current point in time test
Above this use python calculate percentile to realize the data split box code is all I share with you, I hope it can give you a reference, and I hope you support me more.