Fastai V1 and multilabel (Ulmfit)

StatisticDean · October 23, 2018, 3:03pm

Hi everyone, i’ve been trying to use ulmfit in the multilabelcase with fastaiV1. I faced many problems during my implementation, many of which I overcome, and some i’m not sure about, so I thought it could be interesting to share those with you here since I assume i’m not the only one to try to do that. My data comes in the form of a big csv_file with n_labels columns and then 2 columns of text.

My preprocessing splits my data in 3, train_CSV, and val_CSV, of the same shape, and unsup.csv which contains unlabeled data and is only 2 columns of text.

I start by tokenizing my dataset. To do so, I create some specific rules, rewrite the tokenize of the TextDataset Class because i want some different behaviour. Then i call TextDataset.from_csv on my train.csv, which tokenizes my train, i take the vocabulary and uses it to tokenize and numeralize val.csv and unsup.csv.

Then i concatenate my unsup_ids and train_ids to train the language model on. I then call a TextLMDataBunch.from_id_files on my concatenated train and my valid set and create a RNNLearner.language_model to finetune my language model. I then save my model and my encoder.

So far, everything worked fine.

Then, i create my classifier. I start by creating a TextClasDataBunch with TextClasDataBunch.from_id_files.
data_clas = TextClasDataBunch.from_id_files(…)
Then, I create a RNNLearner.classifier on this data, i load the weight from the language model and I try to fit 1 cycle. Of course, it fails. The issue is that the size of the output of the classifier is 2, instead of n_labels.

So, where does this comes from? In text/learner.py, the n_class is defined as follow :
n_class = (len(ds.classes) if (not is_listy(lbl) or (len(lbl) == 1)) else len(lbl))

Where ds is data_clas.train_ds a TextDataSet. So i look at data_clas.train_ds.classes and it has not the right length (n_labels). One quick fix is just to give it the right classes with the right size with
data_clas.train_ds.classes = true_classes .

Edit : One PR to fix this has been merged since I wrote this post. If you’re trying to do multilabel, it should work fine without having to manually set classes.

What happens here is that when you pass classes argument in TextClasDataBunch.from_id_files, it gets forgotten when creating .train_ds. This can be fixed by adding classes to the argument that are transferred to TextDataSet by TextClasDataBunch.from_id_files. (The same bug is in other method too). I proposed a pull request to fix this today.

Once this is done, your classifier wants to output things of the right size. Since you indicated multiple labels, fastai wants to use BCE Loss with logits from pytorch, which is good, because it was the loss function i had chosen.

There was still an issue. This loss wants float tensor as input for both predictions and target. In a lot of cases (including mine), the label of your data are integers, and so you get an error. To fix this error, I went back to my tokenization process (which also creates the train_lbl.npy file), and changed the type to float. My network was finally able to train.

It completed one epoch, and then, while trying to evaluate himself, it gave me another error. It had target as float tensor, but wanted long tensor to compute accuracy. That is because RNNLearner Class from text/learner.py is initialized with accuracy as metrics. Two solution here, change the accuracy function to accept float input (which I didn’t tried yet), or just suppress this metric. So after doing

learn = RNNLearner.classifier(…)

just do

learn.metrics = []

After this final fix, my model was able to train, and seemed to perform correctly (similar performances than in fastai V0.7)

So, to conclude this post that is getting too long, i encountered a few issue while trying to do multilabel classification with ulmfit in fastai V1. I was able to fix most of them, and so i wanted to share with you my solutions. If you have any comments, or any idea to improve on those solutions, feel free to share