Created a databunch using the below: (dataframe consists of ~1500 elements EQUALLY split across two classes in the label column (“positive”, “negative”).
data_clas = TextClasDataBunch.from_df(path=".", train_df = sem_df_trn[[“label”, “text”]], valid_df = sem_df_val[[“label”, “text”]], vocab=data_lm.vocab, bs=100)
a = data_clas.one_batch()
print(a[1]) # prints all labels of the same class.
print(a[1].sum())
print(a[1].size())
My dataframe is random. If I do a df.head(100), I see that both classes are represented. But, the batch resulting from the databunch does not have an equal representation between positive and negative. If the increase the batch size to a really high number, I see that the batch has mostly positive in the beginning followed by mostly negatives towards the end. Any idea what is going on here?
fastai version: 1.0.60
torch: 1.4.0
OS: Debian Stretch