TextClasDataBunch.from_df generating batches with all samples from the same class

srikary · April 3, 2020, 2:26pm

Created a databunch using the below: (dataframe consists of ~1500 elements EQUALLY split across two classes in the label column (“positive”, “negative”).

data_clas = TextClasDataBunch.from_df(path=".", train_df = sem_df_trn[[“label”, “text”]], valid_df = sem_df_val[[“label”, “text”]], vocab=data_lm.vocab, bs=100)

a = data_clas.one_batch()
print(a[1]) # prints all labels of the same class.
print(a[1].sum())
print(a[1].size())

My dataframe is random. If I do a df.head(100), I see that both classes are represented. But, the batch resulting from the databunch does not have an equal representation between positive and negative. If the increase the batch size to a really high number, I see that the batch has mostly positive in the beginning followed by mostly negatives towards the end. Any idea what is going on here?

fastai version: 1.0.60
torch: 1.4.0
OS: Debian Stretch

srikary · April 5, 2020, 7:22pm

After much reading, I think this has something to do with the collate function. I have resampled my dataset, which means that the dataset has a bunch of duplicates. The collate function which is responsible for packaging the data into batches is likely what I need to fix. Is there any example of the collate function?

srikary · April 5, 2020, 8:09pm

Is resampling your pandas dataframe before passing it to fastai TextClasDataBunch the wrong way to do it? It seems like the method to package a bunch of samples into a batch is using some form or hash. Why else would it be packaging similar samples into one batch? I have not read of anyone else running into this issue. So, it is highly likely that the issue is with my approach than with a bug in fastai. Anybody?