I’ve recently tried to submit to a Kaggle competition - NLP task, but I failed to get the appropriate predictions from the model I trained. Basically the task is to predict whether a text belongs to category 1 or to category 0, but when I do “learn.get_preds” I get a 2D tensor with values ranging from 0 to 1. They are also paired. For example: I want to predict the class of text x: ‘Hi’. Instead of getting 0 or 1, I am getting something like: ‘tensor([0.8364, 0.1636])’. Any idea how I can rectify this?
Sounds like the array of probabilities for each class. Just take the argmax to decide class or in the easy binary case you have just ask if first element is >= your threshold value
Thanks. Is there a ‘universal’ threshold for those kinds of tasks or experiments are a better approach? I’ve seen people relaying on 33%.
In real cases, it mostly depends if you think a false positive is more severe than a false negative. For kaggle, 0.5 is pretty standard but you can also calculate which threshold gives you the best score in your validation set
Thanks. Another question: what do you think is the best approach to include categorical data to an NLP predictive model? I thought of combining them in the same text column but I am not really sure this will add value to the model. Any ideas?
There is no need to add it to same text column, you can set the columns to use in fastai. Internally it will end up concatenating them but it will place a special token between them for the model to notice the separation.