Adding a custom sampler to databunch

florobax · June 27, 2019, 2:45pm

Hi!

I’m trying to use a custom weighted sampler for my classifier because of imbalanced data, is there any way to do that ? I thought about using **dl_kwargs when calling databunch() factory method, but as a matter of fact it has 2 problems:

shuffle is forced to True for train_dl and to False for other dataloaders, which prevents me from setting it manually to False for train_dl (can’t pass the argument twice)
I can’t pass a sampler that will only be used for train_dl as **dl_kwargs is passed to all dataloaders
I assume I’ll have to create the dataloaders myself and use the Databunch.__init__, but I wanted to be sure I didn’t miss something simpler before.

Thanks !

sgugger · June 27, 2019, 3:26pm

You can change any dataloader with the new method:

data.train_dl = data.train_dl.new(shuffle=False, sampler=my_sampler)

florobax · June 28, 2019, 8:35am

oh perfect, thanks !

HHousen · December 17, 2019, 4:10pm

@sgugger I too am having this exact problem and I like this solution but I am not sure how I would give my data to the new dataloader. I need to get the PyTorch dataset from the Fasai datablock I believe.

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs,)
        .split_by_idx(valid_idx)
        .label_from_df(cols=dep_var, label_cls=CategoryList)
        .databunch()
        )

I have tried removing the .databunch() from the above and then doing the following:

train_dataset = data.train
data = data.databunch()
data.train_dl = data.train_dl.new(train_dataset, shuffle=False, sampler=sampler)

but I get the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-24-1e5ac58e2822> in <module>
----> 1 data.train_dl = data.train_dl.new(train_dataset, shuffle=False, sampler=sampler)

TypeError: new() takes 1 positional argument but 2 were given

I also think it might be easier just to build the PyTorch datasets and dataloaders with PyTorch and then create a databunch from them. However, with this option I’m not sure how to create the initial PyTorch datasets since I can’t find how Fastai does it for tabular problems.

Thanks for the help.

@florobax How did you end up solving this?

florobax · December 17, 2019, 4:16pm

The method new doesn’t need a dataset to be specified. So You just need to do something like:

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs,)
        .split_by_idx(valid_idx)
        .label_from_df(cols=dep_var, label_cls=CategoryList)
        .databunch()
        )
data.train_dl = data.train_dl.new(shuffle=False, sampler=sampler)

Fastai automatically creates a new dataloader that uses the same dataset but with the modified keyword arguments you pass.

HHousen · December 17, 2019, 4:26pm

Wow. Thanks for the fast reply. @florobax

If I try that I get this error when calling learn.fit_one_cycle():

AssertionError: Your training dataloader is empty, can't train a model.
        Use a smaller batch size (batch size=64 for 338562 elements).

Also, is there a way to get the PyTorch dataset since the sampler I want to use (https://github.com/ufoym/imbalanced-dataset-sampler) requires it: sampler=ImbalancedDatasetSampler(train_dataset)?

florobax · December 17, 2019, 4:34pm

Well I don’t know why it says your dataloader is empty, while it says at the same time that it has 338562 elements. Did you try a smaller batch size as it states ?
If you want to give it the dataset you can access it through data.train_ds.

HHousen · December 17, 2019, 5:13pm

Whoops. I had my sampler parameters messed up. Thanks for the help.