Hi everyone !
I’m trying to deal with a simple binary Text Classification using the text_classifier_learner.
The original full dataset file is named ‘full_dataset.csv’
I already split this dataset into Train, Valid and Test DataFrame.
Each set has 2 columns : Target, Text (Target being the first column in each dataset)
The target is either 0 or 1 and each set has the two values in the target column
I loaded my data as follows:
data_lm = TextLMDataBunch.from_csv(PATH, 'full_dataset.csv')
data_clas = TextClasDataBunch.from_df(PATH, df_train, df_valid, df_test, vocab=data_lm.train_ds.vocab, bs=32)
I’m new to both deep learning and fastAI and I don’t know why there is only Category 1 in my data_clas.test_ds.y
?
Does anyone has the same problem ?
PS : data_clas.train_ds
, data_clas.valid_ds
and data_clas.test_ds.x
look normal, they all look like what I have in the original sets
Thanks in advance
The test dataset is unlabeled in fastai, so you have a fake label repeated all the time (which is the first label in the training set in your version of fastai, will soon be something called EmptyLabel). If you want to validate on two different datasets, you should create two DataBunch
for each of them.
Oh okay ! Will do
Thanks for the quick reply 
So now I have :
data_lm = TextLMDataBunch.from_csv(PATH, 'full_dataset.csv')
data_clas = TextClasDataBunch.from_df(PATH, train_df=df_train, valid_df=df_valid, vocab=data_lm.train_ds.vocab, bs=32)
data_clas_test = TextClasDataBunch.from_df(PATH, train_df=df_train, valid_df=df_test, vocab=data_lm.train_ds.vocab, bs=32)
learn = text_classifier_learner(data_clas, drop_mult=0.5)
learn.loss_func=nn.CrossEntropyLoss()
learn.fit(10)
And I would like to get all the predictions and targets of the valid_ds
of data_clas_test
using the model I trained with data_clas
.
I don’t know if there is a function to do that. I tried get_preds()
but it only returns the predictions and targets of the “preload” train or valid dataset (data_clas
and not data_clas_test
).
I also tried validate()
but I want to get all the predictions of my test set.
For now I found a work around by going manually through my df_test
and predicting line by line but it’s kind of slow.