SoFunction
Updated on 2024-11-19

pandas data grouping and aggregation operation methods

《Python for Data Analysis》

GroupBy

Grouping operations: split-apply-combine (split-apply-combine)

A DataFrame can be grouped on its rows (axis=0) or columns (axis=1). Then, a function is applied to each grouping and generates new values. Finally, the results of the execution of all these functions are merged into the final result object.

The size method of GroupBy returns a Series with the size of the group.

Iterate over the grouping

for (k1,k2), group in (['key1','key2']):
 print k1,k2
 print group

Select a column or group of columns

(['key1','key2'])[['data2']].mean()

Grouping by Dictionary or Series

Simply pass the dictionary or Series to groupby.

Grouping by function

(len).sum() #group people's names based on their lengths

Grouping by Index Level

Hierarchical indexed data, aggregated according to index level, pass in level number or name via level keyword.

(level='cty',axis=1).count()

data aggregation

Optimized groupby method

function name clarification
count Number of non-NA values in the subgroup
sum Sum of non-NA values
mean Average of non-NA values
median Arithmetic mean of non-NA values
std、var Unbiased (n-1 denominator) standard deviation and variance
min、max Minimum and maximum values of non-NA values
prod The product of non-NA values
first、last First and last non-NA values

For the above descriptive statistics methods, the function name can be passed into the agg method as a string. For example: (['mean', 'std'])

To use your own aggregate function, simply pass it into the aggregate or agg method

def peak_to_peak(arr):
 return () - ()
(peak_to_peak)

column-oriented multifunction applications.You can use different aggregate functions for different columns or apply multiple functions at once.

If you pass in a set of functions or function names, the columns of the resulting DataFrame will be named after the corresponding functions

If a list of (name,function) tuples is passed in, the first element of each tuple is used as the column name of the DataFrame.

Different aggregation functions for different columns can also be passed to agg as a dictionary that maps from column names to functions

(['mean', 'std', peak_to_peak]) # 1
([('foo', 'mean'), ('bar', )]) # 2
functions = ['count', 'mean', 'max']
result = grouped['tip', 'bill'].agg(functions) # 3
({'tip' : , 'bill' : 'sum'}) # 4

Group-level operations and conversions

transform

transform applies a function to the individual groupings and then places the result in the appropriate location. If the individual groupings produce a scalar value, that value is broadcast.

apply

General "split-apply-merge"

('smoker').apply(top) is equivalent to the top function being called on the individual segments of the DataFrame, and the result is then labeled by the assembled together with the group name. , so the final result has a hierarchical index with inner index values from the original DataFrame.

Disable group_keys: group_keys will join the index of the original object to form a hierarchical index in the result object. Pass group_keys=False into groupby to disable this effect. ('smoker', group_keys=False).apply(top)

Calling describe on a GroupBy object is equivalent to f = lambda x : (); (f).

Data aggregation tools

Pivot table pivot_table

Aggregate data based on one or more keys and assign data to individual rectangular regions based on grouping keys on rows and columns.

tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'], 
columns='day', aggfunc='mean', fill_value=0)

parameter name clarification
values The name of the column to be aggregated. Default All Columns
rows Column names used for grouping or other grouping keys that appear in the rows of the pivot table
cols Column names used for grouping or other grouping keys that appear in the columns of the pivot table of results
aggfunc Aggregate function or list of functions, defaults to "mean". It can be any function that is valid for groupby.
fill_value Used to replace missing values in the result table
margins Add row/column subtotals and totals, default is False

Crosstab

is a special pivot table for calculating grouping frequencies.

([, ], , margins=True)

Above this pandas data grouping and aggregation operation method is all I have shared with you, I hope to give you a reference, and I hope you support me more.