Wrong target y when loading test set with TextClasDataBunch

Harudori · January 3, 2019, 4:03pm

Hi everyone !

I’m trying to deal with a simple binary Text Classification using the text_classifier_learner.
The original full dataset file is named ‘full_dataset.csv’
I already split this dataset into Train, Valid and Test DataFrame.
Each set has 2 columns : Target, Text (Target being the first column in each dataset)
The target is either 0 or 1 and each set has the two values in the target column
I loaded my data as follows:

data_lm = TextLMDataBunch.from_csv(PATH, 'full_dataset.csv')
data_clas = TextClasDataBunch.from_df(PATH, df_train, df_valid, df_test, vocab=data_lm.train_ds.vocab, bs=32)

I’m new to both deep learning and fastAI and I don’t know why there is only Category 1 in my data_clas.test_ds.y ?
Does anyone has the same problem ?

PS : data_clas.train_ds, data_clas.valid_ds and data_clas.test_ds.x look normal, they all look like what I have in the original sets

Thanks in advance

sgugger · January 3, 2019, 4:27pm

The test dataset is unlabeled in fastai, so you have a fake label repeated all the time (which is the first label in the training set in your version of fastai, will soon be something called EmptyLabel). If you want to validate on two different datasets, you should create two DataBunch for each of them.

Harudori · January 3, 2019, 4:29pm

Oh okay ! Will do
Thanks for the quick reply

Harudori · January 10, 2019, 1:37pm

So now I have :

data_lm = TextLMDataBunch.from_csv(PATH, 'full_dataset.csv')
data_clas = TextClasDataBunch.from_df(PATH, train_df=df_train, valid_df=df_valid, vocab=data_lm.train_ds.vocab, bs=32)
data_clas_test = TextClasDataBunch.from_df(PATH, train_df=df_train, valid_df=df_test, vocab=data_lm.train_ds.vocab, bs=32)

learn = text_classifier_learner(data_clas, drop_mult=0.5)
learn.loss_func=nn.CrossEntropyLoss()
learn.fit(10)

And I would like to get all the predictions and targets of the valid_ds of data_clas_test using the model I trained with data_clas.
I don’t know if there is a function to do that. I tried get_preds() but it only returns the predictions and targets of the “preload” train or valid dataset (data_clas and not data_clas_test).
I also tried validate() but I want to get all the predictions of my test set.

For now I found a work around by going manually through my df_test and predicting line by line but it’s kind of slow.