NLP classification with high cardinality outputs

Update: I’m making another attempt using the MultiCategory method I laid out (I’ll refer to this as MultiLabel for now).

I drew inspiration from the forum post Here speaking about overfitting a multi-label example. This user seems to be facing the same issue I am where they’re model is producing fantastic results but doesn’t seem to generalize.

I’ve made some attempts to better gauge how my model is doing generalizing. I am now using the Hamming Loss and Score to train my model, this intuitively makes more sense given my application. I would rather have 3/4 of the correct emoji predicted than 0 thus making accuracy too harsh a metric. See the example below:

thresh = 0.2
learn = text_classifier_learner(emoji_clas, AWD_LSTM, metrics=HammingLossMulti(thresh=thresh)).to_fp16()

I’m also using my own homebrew metric to assess my model score, I’m calculating the proportion of the test set that yields non-null predictions in order to assess the confidence of the model. See code example below.

labels = learn.dls.vocab[1]

def return_label(row):
    result = []
    for idx, val in enumerate(row):
        if val > thresh:
            result.append(labels[idx])
    
    return "".join(result)
    
    
preds, y = learn.get_preds()
df = pd.DataFrame(preds)
df["pred"] = df.apply(return_label, axis=1)
print("Of "+str(df.shape[0])+" values predicted "+str(df[df["pred"] != ""].shape[0])+" Non-Null results")

Thus far my best results result is only about 1% non-null predictions on the whole test set with my training results looking like so.

Screen Shot 2021-01-11 at 8.27.45 PM

With such infrequent predictions, clearly, I still have lots of work to do.

Let me know if anyone has any ideas