SoFunction
Updated on 2024-11-19

Python solves the Cartesian product from two dataframes using a for loop.

Merging two dataframes without a common column is equivalent to solving the Cartesian product by row number.

The final result is as follows

The following code is modified with reference to other people's code:

def cartesian_df(A,B):
    new_df = (columns=list(A).extend(list(B)))
    for _,A_row in ():
      for _,B_row in ():
        row = A_row.append(B_row)
        new_df = new_df.append(row,ignore_index=True)
    return new_df
#this approach,Error if two tables have duplicate column names

The idea of this code is to loop over each row of the two tables, which runs slowly and should have a complexity of O(m*n), where m is the number of rows in table A and n is the number of rows in table B.

I optimized the code for the above because the merge table I used had more rows and the time was too slow.

The idea is to use the merge function of dataframe, first loop to copy the A table, add the number of loops as columns, and merge directly using merge, the complexity should be O(n) (n is the number of rows in the B table), the code is as follows:

def cartesian_df(df_a,df_b):
  'Find the Cartesian product of two dataframes'
  #df_a Copy n times, index with copy times
  new_df_a = (columns=list(df_a))
  for i in range(0,df_b.shape[0]):
    df_a['merge_index'] = i
    new_df_a = new_df_a.append(df_a,ignore_index=True)
  #df_b set index to rows
  df_b.reset_index(inplace = True, drop =True)
  df_b['merge_index'] = df_b.index
  #merge
  new_df = (new_df_a,df_b,on=['merge_index'],how='left').drop(['merge_index'],axis = 1)
  return new_df

#The two original tables cannot have column names'merge_index'

Tested with a table of 8 rows and a table of 142 rows, the pre-optimization method took: 5.560689926147461 seconds

Optimized method time: 0.1296539306640625 seconds (table with 142 rows as b-table)

According to the principle of calculation, put the table with less rows in table b can be faster, test time: 0.021603107452392578 seconds (8 rows of table as table b)

This speed is already as expected, and the optimization is complete with basically no feeling of waiting.

This is the whole content of this article.