I am trying to perform some aggregations ideally in a parallel fashion.
I am quite frequently using fastai’s parallel function but this is my first time using it for the pandas.
example can be found here:
from functools import partial
from fastcore.parallel import parallel
import seaborn as sns
import pandas as pd
df = sns.load_dataset('tips')
df.head()
# Some example aggregation function
def aggreg(dataf: pd.DataFrame, group_list: list, measure: str, measure_preds: str) -> pd.DataFrame:
dataf = (dataf
.groupby(group_list)
.agg(
measure = (measure, "sum"),
measure_preds = (measure_preds, "sum"))
.reset_index()
)
return print(dataf.head())
#partial the function because fastai's parralel allows us to vary just one parameter
fun_to_paral = partial(aggreg, dataf = df, measure = "tip", measure_preds = "total_bill")
parallel(
fun_to_paral,
# LIst of List of given gruping to use
[["sex"], ["sex", "day"], ["sex", "smoker", "day"]],
n_workers=1,
progress=True,
threadpool=True,
)
I am getting an error that aggreg() got multiple values for argument 'dataf'
. However, I do not see why this might be happening.
I would like to ask how to make this work within the fastai or how to elegantly parallelize it using a different framework.