NLP classification with high cardinality outputs

Hello all,

I’ve recently finished the 2020 course and have set out to put together a project to cement my learnings. My goal for this project is to output recommendations for emojis to “Emojify” text written by users, see example below.

+-------+-------------------------------+----------+
| Index | text                          | label    |
+-------+-------------------------------+----------+
| 1     | Wow, I love taco tuesday!     | 🌮, 🇲🇽    |
+-------+-------------------------------+----------+
| 2     | I had a horrible day at work. | 😢, 😡, 👩‍💻 |
+-------+-------------------------------+----------+

Currently, I am having issues in framing the problem. I’ve been following the ULMFiT example in the fastai docs but all my results thus far have been unimpressive.

In order to accomplish this task, I am scraping data from Twitter. I have a sizable dataset (1M Tweets containing emoji) with reasonable class-balanced.

I have some limited experience working on NLP problems so before I jump into my code I’d like to ask the forum what their strategy would be in framing a problem like this? Thus far I’ve tried the following approaches:

  • Treat this like a MultiCategory problem (Potentially many labels per observation).

To do this I loosely followed the multicast notebook covered in the course, Combining the ULMFiT method with the MultiCategoryBlock. Doing this yielded very high multi-accuracy. The issue with this was the model was very conservative in its output. All predictions were generated with incredibly low confidence. Feeding text like the example I provided above would never result in a prediction even with low (20%) multi-accuracy thresholds.

  • Treat this like a Non-Multi Category problem.

After failing to produce confident predictions using the MultiCategory method I’ve simplified my training data to only include one emoji per observation. The issue here is that I feel this is a loose-fit for my problem. Even with my reasonably large dataset, I’m only producing 5% accuracy. Does anyone else have a recommendation for the metric I should be using to evaluate this model? It seems that emoji recommendation is sort of inherently MultiLabel output or maybe I’m missing something?

  • Treat this like an NLP Domain-specific problem.

Since I have little experience with NLP I wouldn’t be surprised if I’m scoping this problem in the wrong way. I’ve done some basic research into more advanced NLP tools and am aware of the fastai NLP course. Does this type of problem require a more specific toolset that I am lacking entirely? I plan on taking the NLP course eventually but I am not sure if it is critical to producing decent results at a problem like this.

Any thoughts from anyone with more experience with problems like this would be greatly appreciated. I am using this project as a way to continue my learning so I apologize if my thoughts are messy.

Thank you

Update: I’m making another attempt using the MultiCategory method I laid out (I’ll refer to this as MultiLabel for now).

I drew inspiration from the forum post Here speaking about overfitting a multi-label example. This user seems to be facing the same issue I am where they’re model is producing fantastic results but doesn’t seem to generalize.

I’ve made some attempts to better gauge how my model is doing generalizing. I am now using the Hamming Loss and Score to train my model, this intuitively makes more sense given my application. I would rather have 3/4 of the correct emoji predicted than 0 thus making accuracy too harsh a metric. See the example below:

thresh = 0.2
learn = text_classifier_learner(emoji_clas, AWD_LSTM, metrics=HammingLossMulti(thresh=thresh)).to_fp16()

I’m also using my own homebrew metric to assess my model score, I’m calculating the proportion of the test set that yields non-null predictions in order to assess the confidence of the model. See code example below.

labels = learn.dls.vocab[1]

def return_label(row):
    result = []
    for idx, val in enumerate(row):
        if val > thresh:
            result.append(labels[idx])
    
    return "".join(result)
    
    
preds, y = learn.get_preds()
df = pd.DataFrame(preds)
df["pred"] = df.apply(return_label, axis=1)
print("Of "+str(df.shape[0])+" values predicted "+str(df[df["pred"] != ""].shape[0])+" Non-Null results")

Thus far my best results result is only about 1% non-null predictions on the whole test set with my training results looking like so.

Screen Shot 2021-01-11 at 8.27.45 PM

With such infrequent predictions, clearly, I still have lots of work to do.

Let me know if anyone has any ideas

Hey there! I am just reading this whole thread, and I don’t have much to tell you about making multi-label predictions better (I didn’t get much farther than in the thread you linked to) but I do have a possible thought for another way you could frame this problem.

You’re just trying to create emoji ‘suggestions’, right? One thing you could do is treat it like a pure translation problem: you can do ULMfit, create extra tokens that represent all emojis and then fine-tune a language model to input a sentence, and output an emoji (or two). Just a thought, if this multi-label approach isn’t working for you!

1 Like

Thanks for the reply! I’m going to aim to mess around with hyperparameters a bit to see if I can get better results but that’s a great idea! How did your project end up?

Yeah alright! It hasn’t gotten good enough to really be useful I’d say, but I abandoned it fairly quickly to move onto higher priority things. I’m sure multiclass classification has come a long way in the meantime. Good luck!