SoFunction
Updated on 2024-11-13

pandas using merge for 100x faster operations

For non-continuous datasets, a sampling point every seven days is required for data visualization. The requirement is to select the most recent previous data as the current day's data to display, for example, if today is 2019-06-18 and there is only 2019-06-15 in the dataset, then use the data from 2019-06-15 as the data from 2019-06-18 to trace the points.

A sampling point every seven days will make the data selected for each day's display different. When time moves back one day and the date is 2019-06-19, then the latest data point starts at 2019-06-19 and the second one is 2019-06-12.Here an algorithm is needed to quickly go through and select (populate) a series of data based on the current date for data visualization purposes.

A very straightforward implementation:

Mr. Mister forms a string of target time series from a certain start date to today, one date every seven days.

Map these dates to the date of the dataset, Eg. {"2019-06-18": "2019-06-15"...}.

Pull out the data from the map and splice it up.

The code is as follows:

target_dates = pd.date_range(end=now, periods=100, freq="7D")
full_dates = pd.date_range(start, now).tolist()
org_dates = ()

last_date = None
for d in full_dates:
 if d in org_dates:
  date_map[d] = d
  last_date = d
 elif last_date is not None:
  date_map[d] = last_date
 else:
  continue
new_df = ()
for td in target_dates:
 new_df = ([new_df, df[df["date"]==date_map[td]]) 

Such an algorithm takes about ten minutes or so to process a dataset approaching ten million in volume. Closer inspection reveals that the amount of dataframe data merged at each merge is not small, and the total number of operations reaches tens of thousands.

So it's a matter of figuring out how to avoid high-frequency use of de-merging dataframes.

Eventually came up with an ingenious way to do this by simply modifying the previous step 3, converting the map of dates into a dataframe, and then doing a merge operation with the original dataset.

target_dates = pd.date_range(end=now, periods=100, freq="7D")
full_dates = pd.date_range(start, now).tolist()
org_dates = ()

last_date = None
for d in full_dates:
 if d in org_dates:
  date_map[d] = d
  last_date = d
 elif last_date is not None:
  date_map[d] = last_date
 else:
  continue
  
#### main change is from here #####
date_map_list = []
for td in target_dates:
 date_map_list.append({"target_date":td, "org_date":date_map[td]}) 
date_map_df = (date_map_list)
new_df = date_map_df.merge(df, left_on=["org_date"], right_on=["date"], how="inner") 

After the improvement, all the loop operations were on a micro-order of magnitude, and the last merge operation got all the useful data with a runtime of around 5 seconds, which greatly improved performance.

Addendum: Pitfalls of merge merging in Pandas DataFrames (duplicate concatenated keys appear)

The pitfalls I've encountered in my actual development, and the documentation I've consulted, are summarized as follows

left = ({'A': [1, 2], 'B': [2, 2]})

right = ({'A': [4, 5, 6], 'B': [2, 2, 2]})

result = (left, right, on='B', how='outer')

Warning:Joining/merging on duplicate keys may result in the returned frame being a multiplication of row dimensions, which may result in a memory overflow. Duplicate values before joining large DataFrames.

Check for duplicate keys

This validate='one_to_many' parameter can be used if you know the duplicate item DataFrame on the right but want to make sure that there are no duplicate items in the DataFrame on the left, which will not raise an exception.

(left, right, on='B', how='outer', validate="one_to_many")
 
# Printed results.
 A_x B A_y
0 1 1 NaN
1 2 2 4.0
2 2 2 5.0
3 2 2 6.0

Parameters:

validate : str, optional
If specified, checks if merge is of specified type.

“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
“one_to_many” or “1:m”: check if merge keys are unique in left dataset.
“many_to_one” or “m:1”: check if merge keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.

Official Documentation Connection:

The Pandas documentation mentions merge

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more. If there is any mistake or something that has not been fully considered, please do not hesitate to give me advice.