Using Pandas to Implement MySQL Window Functions Solution

I. Preface

Environment:
windows 11 64 bit
Python3.9
MySQL8
pandas1.4.2

This article focuses on window functions in MySQLrow_number()、lead()/lag()、rank()/dense_rank()、first_value()、count()、sum()How to implement it using pandas and also what is the difference between the two.

Note: Python is a very flexible language, to achieve the same goal or a variety of ways, I provide just one of the solutions, you have other methods are welcome to leave a message to discuss.

II. Grammatical comparison

data sheet

The data used in this case are as follows.
The syntax for building this dataset using Python is as follows:

import pandas as pd
import numpy as np

df1 = ({ 'col1' : list(range(1,7))
                    ,'col2' : ['AA','AA','AA','BB','BB','BB']#list('AABCA')
                    ,'col3' : ['X',,'Da','Xi','Xa','xa']
                    ,'col4' : [10,5,3,5,2,None]
                    ,'col5' : [90,60,60,80,50,50]
                    ,'col6' : ['Abc','Abc','bbb','Cac','Abc','bbb']
                   })
df2 = ({'col2':['AA','BB','CC'],'col7':[1,2,3],'col4':[5,6,7]})
df3 = ({'col2':['AA','DD','CC'],'col8':[5,7,9],'col9':['abc,bcd,fgh','rst,xyy,ijk','nml,opq,wer']})

Note: You can just run the code directly from the jupyter cell. The following sections all use thedf1、df2、df3Calls the corresponding data.

The syntax for building this dataset using MySQL is as follows:

with t1 as(
  select  1 as col1, 'AA' as col2, 'X' as col3, 10.0 as col4, 90 as col5, 'Abc' as col6 union all
  select  2 as col1, 'AA' as col2, null as col3, 5.0 as col4, 60 as col5, 'Abc' as col6 union all
  select  3 as col1, 'AA' as col2, 'Da' as col3, 3.0 as col4, 60 as col5, 'bbb' as col6 union all
  select  4 as col1, 'BB' as col2, 'Xi' as col3, 5.0 as col4, 80 as col5, 'Cac' as col6 union all
  select  5 as col1, 'BB' as col2, 'Xa' as col3, 2.0 as col4, 50 as col5, 'Abc' as col6 union all
  select  6 as col1, 'BB' as col2, 'xa' as col3, null as col4, 50 as col5, 'bbb' as col6 
)
,t2 as(
  select  'AA' as col2, 1 as col7, 5 as col4 union all
  select  'BB' as col2, 2 as col7, 6 as col4 union all
  select  'CC' as col2, 3 as col7, 7 as col4 
)
,t3 as(
  select  'AA' as col2, 5 as col8, 'abc,bcd,fgh' as col9 union all
  select  'DD' as col2, 7 as col8, 'rst,xyy,ijk' as col9 union all
  select  'CC' as col2, 9 as col8, 'nml,opq,wer' as col9 
)
select * from t1;

Note: Just put the code directly into the MySQL code runtime box and run it. When you run the SQL code later, by default, you will bring the dataset (lines 1~18 of the code) and only show the query statement, such as line 19.

The correspondence is as follows:

Python datasets	MySQL Data Sets
df1	t1
df2	t2
df3	t3

row_number()

row_number()It is the calculation of row numbers for the retrieved data, incremented from 1. Typically involves grouping fields and sorting fields, with unique row numbers in each grouping.
MySQLrow_number()functions are available in Python using thegroupby()+rank()Achieve a similar effect.

groupby()For single-column aggregation, just pass in the column names, such asgroupby('col2'); in case of multiple columns, pass a list such asgroupby(['col2','col6'])。
rank()Only one column can be sorted, e.g.df.(); when there are multiple columns to sort, you can use thesort_values(['col6','col5']Sort, then aggregate, then use the accumulator functioncumcount()or sorting functionrank()。

Also, note that sorted fields with duplicate values are returned randomly in MySQL, whereas Python defaults to using theindexColumns are further sorted.
Specific examples are listed below:

1. Single-column grouping, single-column sorting
When both grouping and sorting have only one column, using thegroupby()single-column aggregation plusrank()Sorting a single column is sufficient.

multilingualism	Python	MySQL
coding	df1_1 = () df1_1[‘label’] = df1_1.groupby(‘col2’)[‘col5’].rank(ascending=False,method=‘first’) df1_1[[‘col2’,‘col5’,‘label’]]	select col2,col5,row_number()over(partition by col2 order by col5 desc) label from t1;
in the end

2、Multi-column grouping, single-column sorting
When there are multiple columns grouped together, a list is passed to thegroupby()function.

multilingualism	Python	MySQL
coding	df1_1 = () df1_1[‘label’] = df1_1.groupby([‘col2’,‘col6’])[‘col5’].rank(ascending=True,method=‘first’) df1_1[[‘col2’,‘col6’,‘col5’,‘label’]]	select col2,col6,col5,row_number()over(partition by col2,col5 order by col5) label from t1;
in the end

3、Single column grouping, multi-column sorting
If it is a multi-column sort, it is relatively complex, as follows [Python1] first use thesort_values()Order them, and then use thegroupby()aggregation, and then use therank()to add the sort order; and [Python2] is the same as [Python1] for the first 2 steps, with the final step using thecumcount()Realization numbering.

multilingualism	Python	MySQL
coding	【Python1】 df1_1 = () df1_1[‘label’] = df1_1.sort_values([‘col6’,‘col5’],ascending=[False,True]).groupby([‘col2’])[‘col2’].rank(ascending=False,method=‘first’) df1_1[[‘col2’,‘col6’,‘col5’,‘label’]] 【Python2】 df1_1 = () df1_1[‘label’] = df1_1.sort_values([‘col6’,‘col5’],ascending=[False,True]).groupby([‘col2’]).cumcount()+1 df1_1[[‘col2’,‘col6’,‘col5’,‘label’]]	select col2,col6,col5,row_number()over(partition by col2 order by col6 desc,col5) label from t1;
in the end

3、Multi-column grouping, multi-column sorting
Multi-column grouping and multi-column sorting directly in [3、Single column grouping, multi-column sorting] is based on adding multiple grouping fields to thegroupby([])The list in the No further details will be given.

lead()/lag()

lead()is to take a column value backward from the current row, which can also be interpreted as moving the specified column upward; while thelag()On the contrary, is the current row forward to take the value of the column, can also be interpreted as the specified column will be moved down.
In conjunction with the sorting, the two can be interchanged, ie:

formallead()= = in reverse orderlag()
invertedlead()= = positively orderedlag()

In Python, this can be done with theshift()function implements column value movement up and down when passed apositive numberWhen the column valuemove down, when passed anegative numberWhen the column valueupward。
Note: For single/multi-column grouping and single/multi-column sorting, refer to therow_number(), without further ado.

1. Move 1 row
To move 1 row, MySQL directly uses thelead(col1)/lag(col1)Simply, use thelead(col1,1)/lag(col1,1)It's also no problem to combine it with the lift order to realize the up and down movement of the column values.
In Python, theshift(-1)maybeshift(1)achieve the same effect. The following example combines thecol1Downshift, so useshift(-1)。

multilingualism	Python	MySQL
coding	df1_1 = () df1_1[‘col1_2’] = df1_1.groupby([‘col2’]).(-1) df1_1[[‘col2’,‘col1’,‘col1_2’]].sort_values([‘col2’,‘col1’],ascending=[True,True])	【MySQL1】 select col2,col1,lead(col1)over(partition by col2 order by col1) col1_2 from t1; 【MySQL2】 select col2,col1,lag(col1)over(partition by col2 order by col1 desc) col1_2 from t1;
in the end

2、Move multiple lines
When moving multiple rows, you need to specify the number of rows to move in MySQL, as in the following example, to move 2 rows, use thelead(col1,2)maybelag(col1,2)The column values can be moved up and down in combination with the lift order.
In Python, the changes are passed to theshift()The value of the function's argument is sufficient, as in the following example, using theshift(2)Move up 2 rows.

multilingualism	Python	MySQL
coding	df1_1 = () df1_1['col1_2'] = df1_1.groupby(['col2']). (2) # Control by shift df1_1[[‘col2’,‘col1’,‘col1_2’]].sort_values([‘col2’,‘col1’],ascending=[True,True])	【MySQL1】 select col2,col1,lead(col1,2)over(partition by col2 order by col1 desc) col1_2 from t1; 【MySQL2】 select col2,col1,lag(col1,2)over(partition by col2 order by col1) col1_2 from t1;
in the end

rank()/dense_rank()

rank()respond in singingdense_rank()Used to calculate rankings.rank()The ranking may be discontinuous, that is, when there is a duplicate value, the smaller rankings will be used side by side, while the rankings after the duplicate value will be in accordance with the number of duplicates superimposed on each other backward, such as a set of numbers (10,20,20,30), which is (1,2,2,4) in ascending order; anddense_rank()The rankings are consecutive, or the example above, which in ascending order is (1,2,2,3).
In Python, the sorting is also done by therank()function implementation, it's just that themethodcap (a poem)row_number()Used differently. Realizationrank()The effect of makingmethod='min'while realizingdense_rank()The effect of usingmethod='dense'. In addition to these two and inrow_number()void voidmethod='first'Andaveragerespond in singingmax。averageThe logic is that after all values are sorted consecutively without duplicates, the rankings of the duplicates within the grouping are averaged, or the example above, which in ascending order is (1,2.5,2.5,4).maxcap (a poem)minInstead, the ranking that is used is the one that takes the larger of the duplicate values within the grouping to sort, or the example above, which in ascending order is (1,3,3,4).
Similarly, sorted fields with duplicate values are returned randomly in MySQL, while Python defaults to using theindexColumns are further sorted.

Note: For single/multi-column grouping and single/multi-column sorting, refer to therow_number(), without further ado.
1、rank()
Python usesrank(method='min')Implementing the MySQLrank()Window Functions.

multilingualism	Python	MySQL
coding	df1_1 = () df1_1[‘label’] = df1_1.groupby([‘col2’])[‘col5’].rank(ascending=True,method=‘min’) df1_1[[‘col2’,‘col5’,‘label’]]	select col2,col5,rank()over(partition by col2 order by col5) col1_2 from t1;
in the end

2、dense_rank()
Python usesrank(method='dense')Implementing the MySQLrank()Window Functions.

multilingualism	Python	MySQL
coding	df1_1 = () df1_1[‘label’] = df1_1.groupby([‘col2’])[‘col5’].rank(ascending=True,method=‘dense’) df1_1[[‘col2’,‘col5’,‘label’]]	select col2,col5,dense_rank()over(partition by col2 order by col5) col1_2 from t1;
in the end

first_value()

Window Functions in MySQLfirst_value()is to take the first value, can be used to take the first value of the default order of the data, but also by sorting, take the maximum or minimum value of a column.
In Pandas, there are functions with the same functionalityfirst()。
However.first_value()is a window function and does not affect other fields within the form, but thefirst()is a normal function that returns only the row corresponding to the first value in the form, so in Python you have to implement thefirst_value()The same result for the window function requires that thefirst()The result of the function is then linked back to the original table via a table join (see below for an example). In Python, there is also alast()function, and thefirst()Conversely, the same effect can be achieved in combination with sorting, and thefirst()Interchangeable, readers can test for themselves without further ado.

Note: For single/multi-column grouping and single/multi-column sorting, refer to therow_number(), without further ado.
1、Take the maximum value
In MySQL, a pair ofcol5descending order, it is possible to pass thefirst_value()to get the maximum value. Similarly, in Python, usingsort_values()treat (sb a certain way)col5By performing a descending order, it is possible to pass thefirst()to obtain the maximum value, and thenmerge()Return to the original table.

multilingualism	Python	MySQL
coding	df1_1 = () df1_2 = df1_1.sort_values(['col5'],ascending=[False]).groupby(['col2']).first(). reset_index()[['col2','col5']] # It's better to add an ordering df1[[‘col2’,‘col5’]].merge(df1_2,on = ‘col2’,how = ‘left’,suffixes=(‘’,‘_2’))	select col2,col5,first_value(col5)over(partition by col2 order by col5 desc) col5_2 from t1;
in the end

2、Take the minimum value
Taking the minimum value is a change from taking the maximum value of thecol5The ordering can be changed from descending to ascending.

multilingualism	Python	MySQL
coding	df1_1 = () df1_2 = df1_1.sort_values([‘col5’],ascending=[True]).groupby([‘col2’]).first().reset_index()[[‘col2’,‘col5’]] df1[[‘col2’,‘col5’]].merge(df1_2,on = ‘col2’,how = ‘left’,suffixes=(‘’,‘_2’))	select col2,col5,first_value(col5)over(partition by col2 order by col5) col5_2 from t1;
in the end

count()/sum()

MySQL Aggregate Functionscount()cap (a poem)sum()etc., can also be addedover()Realize the effect of the window function.

count()It can be used to find the number of individuals within each grouping, or to totalize the values of a column within a grouping.
sum()It can be used to sum the values of a column in each group, or to add the values of a column in a group.

In Python, for totalizing and accumulating, you can use thegroupby()+cumcount()cap (a poem)groupby()+cumsum()is realized (examples 1 and 2 below), while for counting and summing within a group, it can be done with thegroupby()+count()cap (a poem)groupby()+sum()realization (examples 3 and 4 below).

Note: For single/multi-column grouping and single/multi-column sorting, refer to therow_number(), without further ado.
1. Cumulative in ascending order
Python usessort_values()+groupby()+cumcount()Implementing MySQLcount(<col_name>)over(partition by <col_name> order by <col_name>)Effect.

multilingualism	Python	MySQL
coding	df1_1 = () df1_1[‘col5_2’] = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).()+1 df1_1[[‘col2’,‘col5’,‘col5_2’]]	select col2,col5,count(col5)over(partition by col2 order by col5,col1) col5_2 from t1;
in the end

2. Ascending Accumulation
Python usessort_values()+groupby()+cumsum()Implementing MySQLsum(<col_name>)over(partition by <col_name> order by <col_name>)Effect.

multilingualism	Python	MySQL
coding	df1_1 = () df1_1[‘col5_2’] = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).() df1_1[[‘col2’,‘col5’,‘col5_2’]]	select col2,col5,sum(col5)over(partition by col2 order by col5,col1) col5_2 from t1;
in the end

3. Group counting
Python usessort_values()+groupby()+count()Implementing MySQLcount(<col_name>)over(partition by <col_name>)Effect.

multilingualism	Python	MySQL
coding	df1_1 = () df1_2 = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).().reset_index() df1_1[[‘col2’,‘col5’]].merge(df1_2,how=‘left’,on=‘col2’,suffixes=(‘’,‘_2’))	select col2,col5,count(col5)over(partition by col2) col5_2 from t1;
in the end

4. Summation in groups
Python usessort_values()+groupby()+sum()Implementing MySQLsum(<col_name>)over(partition by <col_name>)Effect.

multilingualism	Python	MySQL
coding	df1_1 = () df1_2 = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).().reset_index() df1_1[[‘col2’,‘col5’]].merge(df1_2,how=‘left’,on=‘col2’,suffixes=(‘’,‘_2’))	select col2,col5,sum(col5)over(partition by col2) col5_2 from t1;
in the end

III. Summary

MySQL's window function effects, in Python, basically require multiple steps and combined processing using multiple functions. Window functions involve grouping fields and sorting fields, which in Python are handled using the correspondinggroupby()cap (a poem)sort_values()So basically, the effects of window functions in Python require the use of these two functions to assist in the processing of data. The rest of the aggregated form is modified according to the characteristics of the aggregated window function, which corresponds to the following:

MySQL Window Functions	Python Correspondence Functions
row_number()	rank()
lead()/lag()	shift()
rank()/dense_rank()	rank()
first_value()	first()
count()	count()、cumcount()
sum()	sum()、cumsum()

To this point this article on the use of Pandas to achieve the solution to the MySQL window function is introduced to this article, more related to the Pandas window function content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!