I would like to ask how to utilize fastai's parallel function when working with pandas data frames

petrhrobar · October 22, 2021, 6:43pm

I am trying to perform some aggregations ideally in a parallel fashion.
I am quite frequently using fastai’s parallel function but this is my first time using it for the pandas.

example can be found here:

from functools import partial
from fastcore.parallel import parallel

import seaborn as sns
import pandas as pd

df = sns.load_dataset('tips')
df.head()

# Some example aggregation function

def aggreg(dataf: pd.DataFrame, group_list: list, measure: str, measure_preds: str) -> pd.DataFrame:
    dataf = (dataf
     .groupby(group_list)
     .agg(
         measure = (measure, "sum"),
         measure_preds = (measure_preds, "sum"))
     .reset_index()
)
    
    return print(dataf.head())

#partial the function because fastai's parralel allows us to vary just one parameter
fun_to_paral = partial(aggreg, dataf = df, measure = "tip", measure_preds = "total_bill")

parallel(
    fun_to_paral,
    # LIst of List of given gruping to use
    [["sex"], ["sex", "day"], ["sex", "smoker", "day"]],
    n_workers=1,
    progress=True,
    threadpool=True,
)

I am getting an error that aggreg() got multiple values for argument 'dataf'. However, I do not see why this might be happening.

I would like to ask how to make this work within the fastai or how to elegantly parallelize it using a different framework.

I would like to ask how to utilize **fastai's** parallel function when working with pandas data frames

I would like to ask how to utilize fastai's parallel function when working with pandas data frames