NLP Multi-regression with datablock api?

jmhsi · March 14, 2019, 10:44pm

A quick forum search and looking through the fastai docs didn’t turn up anything so now I’m posting here. I’m trying to use the datablock api to setup NLP for regressing 5 numbers.

My data looks like this and I want to predict a number from 1-5 each for Location/Food/Social/Opportunities/Safety:
36%20PM

For NLP single-regression (e.g. just Safety), the datablock API works fine and creates a FloatList as my target:

I can create a learner from this which has MSELossFlat and everything works as expected.

However, when I try to have multiple regression outputs, the datablock api treats it as a multi-classification problem of predicting a class, 1-5, for each of Location/Food/Social/Opportunities/Safety:

I went ahead and trained it anyway, and results were much worse when compared to the NLP single-regression version, which I assume is because if it’s treating Safety_1 and Safety_5 as just different classes, it’s lost the meaning that a safety rating of 1 is much closer to a safety rating of 2 with the Multi Category treatment.

So at this point, I’m trying to figure out how to get a TextDataBunch with my y: as a rank2 FloatList? Is there something in the datablock api for that?

Hadus · March 15, 2019, 12:04am

https://docs.fast.ai/data_block.html#Step-3:-Label-the-inputs

The first paragraph in Step 3 says that you can always classify label_cls.
The options it gives are CategoryList , MultiCategoryList or FloatList .

FloatList seems like the thing we want to specify.

try:
.label_from_df(sub_cols, label_cls=FloatList)`

jmhsi · March 15, 2019, 12:13am

Ah that was exactly what I was looking for. I did see that part in the docs but I guess I hadn’t realized what it actually was. Thanks!

sgugger · March 15, 2019, 2:18am

And that’s the one case the library guesses wrong (an array of float is by default considered to be one-hot encoded targets in multi-label) so if you wan’t to add your little stone and document that with a warning, don’t hesitate to submit a PR by editing the relevant part of the data_block.ipynb doc notebook

jmhsi · March 15, 2019, 2:43am

For a PR, can you point me to an example if there’s a specific format or anything I’m supposed to follow?

sgugger · March 15, 2019, 3:17am

The whole process is documented here. In terms of format, you should blend in the style of the current docs notebook.