NLP transfer learning multi label

This is more or less an skeleton of working code. The first part you may take it from the Jypiter journal.

Of course, any insight if I am not doing something optimally is most welcome!

path = Path(*your path to the folder with your unsupervised documents*) 
model_url = *url to your pre-trained language model, e.g., from fastai in English: URLs.WT103_1*

Then try to load your data for the language model:

        data = TextLMDataBunch.load(path, 'tmp_lm', bs=batch_size)
    except FileNotFoundError:
        print('Data bunch not found, creating one from data source...')
        data = (TextList.from_folder(path)

Now you instantiate a language model learner:

learner = language_model_learner(data, pretrained_model=model_url, drop_mult=0.3)

Learning rate finder:


Then you train the last layers of the language model:

    except FileNotFoundError:
        print('\nTraining language model (last layers)...')
        learner.fit_one_cycle(1, 5e-2, moms=(0.8, 0.7))'fit_head')

Then you train the whole thing. In the Jupyter example they use 10 cycles here, but in my case apparently one is better? I’m still figuring out these details.

    except FileNotFoundError:
        print('\nFine-tuning learner...')
        learner.fit_one_cycle(number_rounds, 5e-3, moms=(0.8, 0.7))'fine_tuned')

To test the language model (which is fun):

    text_prompt = 'I wonder what text comes after this"
    n_words = 100
    n_sentences = 2
    print("\n".join(learner.predict(text_prompt, n_words, temperature=0.75)
                    for _ in range(n_sentences)))

Save the language model encoder (the part that the classifier will use):


Next step: we need a classifier. IMPORTANT: we need the vocabulary from the language model!

vocab = data.vocab

Loading its dataset.

        classifier_data = TextDataBunch.load(path,
    except FileNotFoundError, IndexError:
        print('Some error message')
        label_cols = [0, 1, 2, 3]  # the columns from which you take the labels in the csv file
        classifier_data = (TextList.from_csv(path,

Then you create the classifier learner:

classifier_learner = text_classifier_learner(classifier_data,

Finally, just train the learner. I have not been completely successful here yet, although it trains and learns and classifies. Just not as good as another classifier I have…


classifier_learner.fit_one_cycle(1, 1e-1, moms=(0.8, 0.7))'first_cycle')

classifier_learner.fit_one_cycle(1, 5e-2, moms=(0.8, 0.7))'second_cycle')

classifier_learner.fit_one_cycle(1, slice(1e-2 / (2.6 ** 4), 1e-2), moms=(0.8, 0.7))'third_cycle')

classifier_learner.fit_one_cycle(1, slice(5e-3 / (2.6 ** 4), 5e-3), moms=(0.8, 0.7))'fourth_cycle')

classifier_learner.fit_one_cycle(1, slice(1e-3 / (2.6 ** 4), 1e-3), moms=(0.8, 0.7))'fifth_cycle')

That’s it! Now you may load a saved classifier:


And classify stuff:

prediction = classifier_learner.predict(string)

This is not exactly a working example, but it comes close. I hope it helps!


Great work. I have not tried this yet, but I was wondreing, would this work for multi-label regression? I want to build one model that has 5 labels, the labels range from 1.0 to 7.0, and it should be able to output several labels such as: out > [6.2, 0.1, 0.3, 0.6, 1.1]

I don’t see why it shouldn’t work! Make sure to update your loss function though (learner.loss_func). Let us know how it goes!

Question, @sgugger: the loss function is currently inferred by the data type. In my case I have several label columns, so it correctly selects BCE. Does this change if the targets are floats? Would this be desirable?

You targets re floats behind the scenes otherwise BCE wouldn’t be happy :wink:
When you want to do a regression with multiple columns, you have to pass label_cls to override the default, and it will then infer a correct default loss function (they are given by the targets).

1 Like

When I did 10 cycles, the validation loss got worse for 2 cycles and then got better. I’m guessing that maybe this was related to the triangle learning rate algo in “fit_one_cycle”?

1 Like

Hi! I am using a similar approach to classify 4 variables (each with 3 levels[1,2,3])

E.g. the variable weather should only appear as clear or cloudy, but not both at the same time.
As I understood, in case the minimum treshold is reached for both, this might happen.

Is there a way to ensure that each variable (having multipe levels) is classified exactly once?

The same problem might occure if the treshhold for one variable is not reached, then the model might predict only 3 instead of 4 variables. Can´t I always choose the level that the model has the highest confidence in?

Why don’t you try using a custom loss function so that you can have proper multi-class prediction for each label independently?

Imagine you have label A with three classes (A1, A2, A3) and then labels B and C (which may be or not present, multi-label style).

Your model would need 5 outputs: 3 would go for A (like in a common classifier) and then one output for B and another for C. Your loss would take the first three for a multi-class-like loss (including a softmax function) and then the last two outputs for a multi-label-like part of the loss (including a sigmoid function).

Does this sound reasonable?

Yes, unfortunately thats exactily what I am struggeling with.
As I understood the loss function is inferred from my data. Without further specification I obtain my loss function: FlattenedLoss of BCEWithLogitsLoss()
I agree, a argmax function for each label (with multiple classes) would to be the best fit.

Unfortunately my model still predicts multiple classes for one label. Is there another loss function I need to specify in for such cases?

**talking about the multi-class-like case: category A with labels(A1,A2,A3), category B labels (B1,B2,B3), …

In your case I would try writing my own custom loss function. Fortunately it is super easy to override the default loss. Try:

learner.loss_func = MyLoss

Your loss function should be very easy, you just need to separate the elements from your predictions (and labels) and pass them to a normal loss function.

Does this make sense?

Yes, I will try to change the loss function.

From what I understood, the function needs to look at on category at a time and apply argmax to the 3 corresponding output neurons. If this process is applied for all 4 categories I should receive an output like e.g. (A1,B1,C2,D1)

So I think I can find out the order in which my output neuron predict the different labels. But how can I select several neurons from my output layer that I want to perform the argmax function with?

If I’m not mistaken you can simply slice the output tensor like you would a Numpy array. If your batch is the first dimension, for example, you could do:

preds_for_a = preds[:, :3]

Then you simply need to understand the order of your labels, since your outputs will mean what you decide they mean (and pass to your loss and appropriate sigmoid/softmax activations).

Thanks for your reply. Makes sense. It try to model the loss function like this

def myloss()
mse_loss = nn.CrossEntropyLoss()
A = mse_loss(inpA, target)
B = mse_loss(inpB, target)
C = mse_loss(inpC, target)
D = mse_loss(inpD, target)
loss = A + B + C + D

with the inputs from my output layer

inpA = preds[:, :3]
inpB = preds[:, 3:6]

I need to slice the target tensor accordingly I assume. Is there an object storing this information already?

I think it makes sense to create 4 different metrics for accuracy, one for each category correct?

This sounds about right! About the object storing the target information, that probably depends on how you created your batch.

One problem you may find is that multi-class often expects an output for each class (what goes into the softmax, although that actually happens within the loss function if you use the right one) but a class ID as target, not a vector! Like this:

Out: [0.1, 0.0, 0.9]
Target: 2

So in your case you would need your targets to be a concatenation of such indices. I don’t know the best way to do this in Fastai.

Having an accuracy for each category sounds good!

I’m trying to do something along these lines and have reviewed a bunch of forum posts & the docs, but am struggling.

I have 9 separate outcome columns which are categorical variables (they are all survey questions). The first outcome has two options (0 or 1) and the remaining 8 outcomes have 6 options (-1 through 4). Am I correct that these should be the MultiCategoryList class type? And for label_cls does setting it to a class type automatically apply to all of my outcome columns, or do I need to repeat it for each outcome column?

I’m also wondering - is it not possible to apply negative log-likelihood loss to each outcome and then sum the loss across all 9 outcomes, without needing to implement a custom loss function as proposed above? My sense/hope is that would essentially do that if I specified my label_cls correctly, although I may also need to change my Series in the source pandas dataframe?

Here is my current code:

data_clas = (TextList.from_df(df, data_path,
                              cols = text_col,
                              vocab = data_lm.vocab)
                 .random_split_by_pct(valid_pct = 0.2, seed = 1)
                 .label_from_df(cols = outcomes,
                                label_cls = MultiCategoryList)
                 .databunch(bs = bs))

# This should be BCEWithLogitsFlat based on
# However based on the documentation this seems to be for binary outcomes, not categorical.

# Number of outputs/classes in the final layer of the model. Returns 9.

Given that it’s choosing BCEWithLogsFlat as the loss_func it seems that I am not specifying the label_cls correctly, or something else is going wrong?

1 Like

A MultiCategoryList is when you have mutli-label data: that means that a sample could have several tags to predict. This isn’t what you want here, and you will probably need to write a custom ItemList to label your targets as well as a custom head for your network and then write your custom loss function if you want to apply cross entropy for each outcome.


To me it seems that MultiCategoryList is a misnomer. A CategoryList is a single categorical outcome, so “MultiCategory” implies multiple categorical outcomes.

Multi-label data, i.e. multiple binary outcomes, would be more appropriately named MultiLabelList, MultiClassificationList, or MultiBinaryList - this last one would be more consistent with CategoryList & FloatList. Any of these would be consistent with the docstring: “Basic ItemList for multi-classification labels.”

Otherwise, for multiple categorical outcomes, what would be the appropriate class name? MultiCategoryList is the only clear name that comes to mind. I ask because I would like to submit this as a PR when it’s ready.

1 Like

I am working on mulit-labels text classification problem.
I want to get an official(or real working) copy code example on mulit-labels classification using text_classifier_learner (for the newest fastai version). would you kindly post the URL?

1 Like

Hi @nwzjk

Try following this post in the thread.

Note that you can simply follow the Jupyter example for mul-class text classification. The only thing you need to change is the data loader you use (the other changes happen automatically behind the scenes: adapting the number of outputs and changing the loss function), and you have the new one in the post I linked.

Let us know if you need any help!

1 Like

@Pablo Hello Pablo,
I am looking to do a multi label text classification . can you help me out where i have done wrong in the link mentioned below.Multi label text classification .Hope you will help me out to solve the problem…?

I’m sorry Anish, I’m a bit swamped at the moment with work and children :confused: