NLP transfer learning multi label

Pablo · January 15, 2019, 4:36pm

I am working on NLP with ULMFIT. I have a language model working, and now I would like to transfer that… to multi-label classification.

This last step is trivial in the notebook when working on multi-class (single-label) classification, and I was wondering if there is any similar work done for multi label. The two pieces that I still need to find (or create) are a multi-label learner and the data loader.

Thanks!

Pablo

Pablo · January 17, 2019, 4:25pm

I am now much closer to making things work. Hopefully. I will write some updates here for those interested. Any feedback is obviously welcome!

So the first step is obviously following the notebook on text transfer learning right until Classifier. Make sure to follow the latest version running on v1!

It is possible to use text_classifier_learner. Note that a model for multi-class (single-label) classification and multi-label classification only differ in the number of outputs! (Because the softmax or sigmoid activation is applied by the loss function). n_class in text_classifier_learner will take the correct value (the number of labels, not the number of “classes”: this variable should probably be renamed).

I update the learner’s loss function in text_classifier_learner, by adding

learner.loss_func = BCEWithLogitsFlat()

This is probably not needed [not needed indeed!], but printing the type is not specific enough [it is: print learner.loss_func.func], so I will leave this in untill I am sure it is not needed.

Next we need a data loader. I am using the API for this:

        self.data = (TextList.from_csv(path,
                                   'multi_label.csv',
                                   cols='text',
                                   vocab=self.vocab)
                 .random_split_by_pct(valid_pct=0.2)
                 .label_from_df(cols=[0, 1])
                 .databunch(bs=self.batch_size))

My data is in a csv file, where (this is an example) the labels are in the first two columns (the column name is the label name, the cell value is a float, 0 or 1 in my case). The column “text” has the plain text (unprocessed). Don’t forget to pass here your vocabulary from the language model! I am using random split (you may use a bool column for this as well).

With these ingredients I am able to at least instantiate the classifier:

classifier = text_classifier_learner(self.data, drop_mult=0.5, metrics=[fbeta])

And this is as far as I’ve got so far. Now I need to see if training happens, and I will need to look into the metrics method, and probably create a custom one. [After the edits this code should be good enough!]

Note: I have edited this post to include suggestions by @sgugger, thanks!

sgugger · January 17, 2019, 4:51pm

Isn’t the loss function automatically BCELossFlat if you use the regular text_classifier? Normally it should be inferred from your data.

Pablo · January 17, 2019, 4:59pm

Very interesting, thanks!

You are probably right. I did the learner before I started working on the data loader, so I didn’t expect that. Tomorrow I will double check that you are right

Pablo · January 17, 2019, 5:02pm

Perhaps also the number of outputs (n_class or n_labels) could be inferred from the data…

sgugger · January 17, 2019, 5:14pm

Check what data.c and data.loss_func are, but if it’s labelled properly (you can pass label_cls in the data block API if it isn’t) you should have the right values there.
Then the learner created by text_classifier_learner will use those.

Pablo · January 18, 2019, 10:53am

So data.c automatically takes the correct value (if there are 3 label columns then it’s 3), so there is no need to modify text_classifier_learner which is good! (I’ll edit the post above.)

I am not sure about the loss function, though, because printing its type only returns the super class FlattenedLoss, which could be either. Is there a more proper way to check its type?

Pablo · January 18, 2019, 11:11am

[Edit: after getting all pieces working and resetting all data files this error is gone. It was probably my fault somehow.]

I do get one error when trying to load a saved data bunch.

Doing:

self.data.save('tmp_class_data')

self.data = TextDataBunch.load(self.path, 'tmp_class_data',  bs=self.batch_size)

produces the following error:

only integers, slices (:), ellipsis (…), numpy.newaxis (None) and integer or boolean arrays are valid indices.

Should I use a different class to load? Or is any of the parameters at save/load wrong?

Pablo · January 18, 2019, 1:47pm

The final detail I needed to get things working (apart from the error above) is to change the metric function, which is by default accuracy. For multi-label classification consider the F1 score, or fbeta from fastai.metrics.

Note! Pass this as a parameter to the learner, and remember the value needs to be a list of metrics! (Like metrics=[fbeta].)

Improvements to be considered:

infer default metric in text_classifier_learner, depending on the data provided.
accept non-list value as parameter (it is easy to check if it is not a list, and then make it into one)

sgugger · January 18, 2019, 2:45pm

You can check the loss function with learn.loss_func.func when learn.loss_func is a FlattenedLoss (but I agree this is not ideal so I’ll change the representation to make it clearer).
I don’t have the same problem with save/load but it’s not working properly with the classes, I will take a look.
If your data is for multiclassification, the default metric is an empty list (just checked and it works) and you can pass a list or just one function (are you sure you are on the latest version of fastai?)

Pablo · January 18, 2019, 3:20pm

My version is fastai==1.0.39

The default loss is correct!

However, I still get an error if I pass the metric function by itself, like:

metrics=my_f1

But this works fine:

metrics=[my_f1]

Pablo · January 18, 2019, 3:28pm

I have also reset all files (as I had already done!) and loading is working fine now… sadly, I don’t know what has changed, so I can’t replicate the error.

sgugger · January 18, 2019, 3:32pm

The latest version is 1.0.40
The bug with loading I had has been fixed in master (so v1.0.41 when it’s released).

Pablo · January 18, 2019, 3:44pm

Things move fast!

Although this got me an error which seems intentional on your side: I was downloading a pre-trained model from a url using the untar_data method, but it seems like there is a check now that rejects unknown urls.

How else should I be uploading a model? (This way was a bit weird, but actually quite convenient if working on a team!)

The error I get:

 if force_download or (fname.exists() and _check_file(fname) != _checks[url]):
 KeyError: 'my_personal_url'

sgugger · January 18, 2019, 3:47pm

Pushed a fix for that.

Pablo · January 18, 2019, 3:48pm

Great, thanks!

krasin · January 22, 2019, 8:00am

For me multi-label text classification works fine with fastai 1.0.28 but TextClasDataBunch.from_df was not working with fastai 1.0.34:

    331         if o is None: return None
--> 332         return MultiCategory(one_hot(o, self.c), [self.classes[p] for p in o], o)
    333 
    334     def analyze_pred(self, pred, thresh:float=0.5):

TypeError: string indices must be integers

Pablo · January 22, 2019, 11:39am

Try using the data loading API like in my example. There is a from_df option there as well, and it very likely works fine, because from_csv actually calls that after loading the csv file…

Please do share any other insights!

LarryX · January 22, 2019, 4:49pm

hey @Pablo

Would you mind sharing working example?

Thanks!

Pablo · January 22, 2019, 5:07pm

Hi LarryX!

Thanks for your interest! My code includes many scripts and helper methods because I am integrating it in a larger project. I am also not sure if I am allowed to simply share the plain code like that, but I can absolutely share a more detailed guide of all the steps that will get you a minimal working example, with non of my added stuff. I am leaving now for home, so I will have to do this tomorrow (Do write again if I forget, which I hope won’t happen).

But I anticipate that the code is very similar to the one in the Jupyter journal. Changes to make it work with multi-label classification are actually minimal! (And described in the first answer in this thread.)