SoFunction
Updated on 2024-11-18

Python Pandas Grouping Aggregation Implementation

Pycharm Mouse over a function, CTRL+Q for a quick look at the documentation, and CTR+P to see the basic parameters.

apply(), applymap() and map()

apply() and applymap() are functions of DataFrame and map() is a function of Series.

The operation object of apply() is a row or column of data in a DataFrame. applymap() is every element of the DataFrame. map() is also every element in a Series.

apply() to batch process the contents of a dataframe, which is faster than looping. Such as (func,axis=0,.....) func: the definition of the function, axis = 0 for the operation of the column, = 1 for the operation of the row.

map() is no different from python's built-in ones, e.g. df['one'].map(sqrt).

import numpy as np

from pandas import Series, DataFrame

 

frame = DataFrame((4, 3),

         columns = list('bde'),

         index = ['Utah', 'Ohio', 'Texas', 'Oregon'])

print frame

print (frame)

print

 

f = lambda x: () - ()

print (f)

print (f, axis = 1)

def f(x):

  return Series([(), ()], index = ['min', 'max'])

print (f)

print

 

print 'applymap and map'

_format = lambda x: '%.2f' % x

print (_format)

print frame['e'].map(_format) 

Groupby

Groupby is the most commonly used and effective grouping function in Pandas, with sum (), count (), mean () and other statistical functions.

The DataFrameGroupBy object returned by the groupby method doesn't actually contain the data, it records the intermediate data of df['key1']. When you apply a function or other aggregation operation to the grouped data, pandas then performs a quick chunking operation on the df based on the information recorded in the groupby object and returns the result.

df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],

        'key2': ['one', 'two', 'one', 'two', 'one'],

        'data1': (5),

        'data2': (5)})

grouped = (df['key1'])

print () 



(lambda x:'even' if x%2==0 else 'odd').mean() # Grouping by function

Aggregate agg()

For grouped a column (row) or multiple columns (rows, axis = 0/1), apply agg (func) can be applied to the grouped data after the func function. For example: grouped ['data1'].agg ('mean') is also grouped 'data1' column for the mean. Of course, you can also act on multiple columns (rows) and use multiple functions at the same time.

df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],

        'key2': ['one', 'two', 'one', 'two', 'one'],

        'data1': (5),

        'data2': (5)})

grouped = ('key1')

print ('mean')

 

     data1   data2

key1          

a   0.749117 0.220249

b  -0.567971 -0.126922 

apply () and agg () functionally similar, apply () is often used to deal with different groups of missing data to fill and top N calculation, will produce a hierarchical index.

And agg can be passed multiple functions that act on different columns at the same time.

df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],

        'key2': ['one', 'two', 'one', 'two', 'one'],

        'data1': (5),

        'data2': (5)})

grouped = ('key1')

print (['sum','mean'])
print ()  The same applies to #apply here, except that you can't pass in more than one, and the two functions are basically universal.

         data1               data2         
           sum      mean       sum      mean
key1                                       
a     2.780273  0.926758 -1.561696 -0.520565
b    -0.308320 -0.154160 -1.382162 -0.691081


         data1     data2 key1       key2
key1                                   
a     2.780273 -1.561696  aaa  onetwoone
b    -0.308320 -1.382162   bb     onetwo

The functions of apply and agg are basically similar, but it is more convenient to use agg for multiple functions.

Apply itself has a high degree of freedom, and is useful if the grouping is not followed by an aggregation operation tightly followed by some observations.

print (lambda x: ())

 

        data1   data2

key1             

a  count 3.000000 3.000000

   mean -0.887893 -1.042878

   std  0.777515 1.551220

   min  -1.429440 -2.277311

   25%  -1.333350 -1.913495

   50%  -1.237260 -1.549679

   75%  -0.617119 -0.425661

   max  0.003021 0.698357

b  count 2.000000 2.000000

   mean -0.078983 0.106752

   std  0.723929 0.064191

   min  -0.590879 0.061362

   25%  -0.334931 0.084057

   50%  -0.078983 0.106752

   75%  0.176964 0.129447

   max  0.432912 0.152142 

In addition apply can change the dimension of the returned data.

/pandas-docs/stable/

There is also a pivot table pivot_table and a crosstab crosstab, but I haven't used them.

This is the whole content of this article.