SoFunction
Updated on 2024-11-18

pandas group aggregation(agg,transform,apply)

In daily data analysis, it is often necessary to divide the data into different groups according to a certain (multiple) fields for analysis, such as the e-commerce field will be divided into the total sales of the country according to the provinces, to analyze the change of sales in each province, the social field will be divided into the users according to the portrait (gender, age), to study the use of the user's situation and preferences, and so on. In Pandas, the above data processing operations are mainly completed using groupby, this article introduces the basic principles of groupby and the corresponding agg, transform and apply operations.

1. groupby grouping

For the convenience of subsequent illustrations, 10 samples of data generated by the simulation are used, and the code and data are as follows:

import pandas as pd
import numpy as np

data = ({
  "company": ["A", "B", "C", "A", "B", "C", "A", "B", "A", "A"],
  "salary": [10000, 10000, 50000, 50000, 40000, 50000, 30000, 10000, 20000, 40000],
  "age": [25, 30, 35, 40, 45, 20, 25, 30, 30, 35]
}
)
print('data = \n', data)

The output of the above code is as follows:

data = 
  company  salary  age
0       A   10000   25
1       B   10000   30
2       C   50000   35
3       A   50000   40
4       B   40000   45
5       C   50000   20
6       A   30000   25
7       B   10000   30
8       A   20000   30
9       A   40000   35

1.1 Unilinear grouping

group = ("company")
print('group = ', group)  
print('list(group) = \n', list(group))

for index, data in group:
    print('index = ', index)
    print('data = \n', data)

The output of the above code is as follows:

group =  < object at 0x0000000001DA85E0>
list(group) = 
 [('A',   company  salary  age
0       A   10000   25
3       A   50000   40
6       A   30000   25
8       A   20000   30
9       A   40000   35), ('B',   company  salary  age
1       B   10000   30
4       B   40000   45
7       B   10000   30), ('C',   company  salary  age
2       C   50000   35
5       C   50000   20)]
index =  A
data = 
   company  salary  age
0       A   10000   25
3       A   50000   40
6       A   30000   25
8       A   20000   30
9       A   40000   35
index =  B
data = 
   company  salary  age
1       B   10000   30
4       B   40000   45
7       B   10000   30
index =  C
data = 
   company  salary  age
2       C   50000   35
5       C   50000   20

1.2 Multi-column grouping

df_gb = (['company', 'salary'])
for (index1, index2), data in df_gb:
    print((index1, index2))
    print('data = \n', data)

The output of the above code is as follows:

('A', 10000)
data = 
   company  salary  age
0       A   10000   25
('A', 20000)
data = 
   company  salary  age
8       A   20000   30
('A', 30000)
data = 
   company  salary  age
6       A   30000   25
('A', 40000)
data = 
   company  salary  age
9       A   40000   35
('A', 50000)
data = 
   company  salary  age
3       A   50000   40
('B', 10000)
data = 
   company  salary  age
1       B   10000   30
7       B   10000   30
('B', 40000)
data = 
   company  salary  age
4       B   40000   45
('C', 50000)
data = 
   company  salary  age
2       C   50000   35
5       C   50000   20

1.3 Basic principles of groupby

In pandas, the code to implement the grouping operation is very simple and requires only one line of code, where the above dataset is divided according to the company field

group = (“company”)

will get a DataFrameGroupBy object, then this generated DataFrameGroupBy is what? The result of the return is its memory address, and is not conducive to intuitive understanding, in order to see what exactly is inside the group, here to convert the group into a list of forms to see. Or for loop iteration out to see.

< object at 0x000002B7E2650240>

Converted into the form of a list, you can see that the list consists of three tuples, each tuple, the first element is the group (here is in accordance with the company for grouping, so finally divided into A, B, C), the second element is the corresponding group under the DataFrame, the entire process can be illustrated as follows:

groupby

To summarize, the process of groupby is the original DataFrame according to the groupby field (here is company), divided into a number of grouped DataFrame, is divided into as many groups as there are as many grouped DataFrames. so that after groupby a series of operations (such as agg, apply, etc.), are based on sub-DataFrame operations. So, after groupby, a series of operations (e.g. agg, apply, etc.) are based on the operations of sub-DataFrames. Understand this, it is also basically clear Pandas groupby operation of the main principles. Here to talk about the common operations after groupby.

2. groupby|agg aggregation

Aggregation is a very common operation after groupby, and those who can write SQL should be very familiar with it. Aggregate operations can be used to sum, average, maximum, minimum, etc. The following table lists the common aggregation operations in Pandas.

pandas agg

2.1 Aggregation of all data after grouping

Aggregate other columns after grouping by default

df_agg = ('company').agg(['min', 'mean', 'max'])
print(df_agg)

The output of the above code is as follows:

        salary               age          
           min   mean    max min  mean max
company                                   
A        10000  30000  50000  25  31.0  40
B        10000  20000  40000  30  35.0  45
C        50000  50000  50000  20  27.5  35

2.2 Aggregation of grouped partial columns

Certain situations, where only part of the data needs to be subjected to different aggregation operations, can be constructed from dictionaries

print(('company').agg({'age': ['min', 'mean', 'max']}))

The output of the above code is as follows:

        age          
        min  mean max
company              
A        25  31.0  40
B        30  35.0  45
C        20  27.5  35
        age           salary

print(('company').agg({'age': ['min', 'mean', 'max'], 'salary': 'min'}))

The output of the above code is as follows:

        age           salary
        min  mean max    min
company                     
A        25  31.0  40  10000
B        30  35.0  45  10000
C        20  27.5  35  50000

print(('company').agg({'salary': 'median', 'age': 'mean'}))

The output of the above code is as follows:

         salary   age
company              
A         30000  31.0
B         10000  35.0
C         50000  27.5

2.3 Diagrammatic representation of the agg polymerization process

pandas agg

3. groupby|transform

What kind of data manipulation is transform? What is the difference with agg? In order to better understand the difference between transform and agg, the following comparison from the actual application scenarios.

In the above agg, we learned how to find the average salary of employees of different companies, if now need to add a new column avg_salary in the original dataset, on behalf of the average salary of the company where the employee is located (the same company employees have the same average salary), how to achieve it? If you follow the normal steps to calculate, you need to first find the average salary of different companies, and then in accordance with the employees and the company's corresponding relationship to fill the corresponding position, do not have to transform, then, the realization of the code is as follows:

avg_salary_dict = ('company')['salary'].mean().to_dict()
data['avg_salary'] = data['company'].map(avg_salary_dict)
print('data = \n', data)

The output of the above code is as follows:

data = 
   company  salary  age  avg_salary
0       A   10000   25       30000
1       B   10000   30       20000
2       C   50000   35       50000
3       A   50000   40       30000
4       B   40000   45       20000
5       C   50000   20       50000
6       A   30000   25       30000
7       B   10000   30       20000
8       A   20000   30       30000
9       A   40000   35       30000

Only one line of code is needed if transform is used:

data['avg_salary'] = ('company')['salary'].transform('mean')
print('data = \n', data)

The output of the above code is as follows:

data = 
   company  salary  age  avg_salary
0       A   10000   25       30000
1       B   10000   30       20000
2       C   50000   35       50000
3       A   50000   40       30000
4       B   40000   45       20000
5       C   50000   20       50000
6       A   30000   25       30000
7       B   10000   30       20000
8       A   20000   30       30000
9       A   40000   35       30000

3.1 Illustration of the transform realization process

Or to illustrate the way to see the groupby transform after the realization of the process (in order to more intuitive display, the figure added the company column, the actual code in accordance with the above only salary column):

pandas

The big box in the figure is the transform and agg are not the same place, for the agg, will be calculated to get A, B, C company corresponding to the mean and return directly, but for the transform, it will be for each piece of data to get the corresponding results, the same group of samples within the same group will have the same value, the group after the mean will be in accordance with the original index of the order of the return results, if there is no understanding of the can take this picture and agg that comparison.

4. groupby|apply

apply should be an old friend, it is more flexible compared to agg and transform, able to pass any custom function to achieve complex data manipulation. In the Pandas data processing three axes - map, apply, applymap details, introduced the use of apply, then use apply after groupby and previously introduced what is the difference?

The difference is there, but the whole realization principle is basically the same. The difference between the two is that, for the groupby after the apply, after the grouping of sub-DataFrame as a parameter passed into the specified function, the basic unit of operation is DataFrame, while the basic unit of operation of the previous introduction of the apply is Series. still a case to introduce the use of the groupby after the apply.

Suppose I now need to get the data of the oldest employees of each company, how to realize it? This can be achieved with the following code:

def get_oldest_staff(x):
    df = x.sort_values(by= 'age', ascending = True)
    return [-1, :]


oldest_staff = ('company', as_index = False).apply(get_oldest_staff)
print('oldest_staff = \n', oldest_staff)

The output of the above code is as follows:

oldest_staff = 
   company  salary  age
0       A   50000   40
1       B   40000   45
2       C   50000   35

The flowchart is shown below:

apply

You can see that the apply and the principle of the role described in the previous article is basically the same, but the parameters passed into the function from the Series changed to the group DataFrame here.

Finally, about the use of apply, here is a small suggestion, although it is said that apply has more flexibility, but the running efficiency of apply will be slower than agg and transform. So, after groupby can use agg and transform to solve the problem or prioritize the use of these two methods, really can not be solved before considering the use of apply operation.

to this article on pandas grouping aggregation (agg, transform, apply) of the article is introduced to this, more related pandas grouping aggregation content, please search for my previous posts or continue to browse the following related articles I hope you will support me in the future!