Is it possible to create a multi-class text classifier for sentiment analysis, and how well does it work with the ULMfit model? I want to try one of the Semeval tasks ( currently can’t find the link), for multi-class classification.
This idea helped me out a lot. I was able to load multiclass labels is through a single column and then use a delimiter like your example and the planets lesson. Here’s how I shuffled around the labels when working with the kaggle toxic comments competition.
First, I relabeled the one-hot encoded columns with their actual meaning. Then I joined them in another column called label. Resaved the df, and then was able to use TextDataBunch to load it properly.
In this case, and probably lots of others, there were many samples that weren’t labeled at all so I labeled them as okay. I wonder if theres any reason not to do this? Loading the data via TextDataBunch was erroring due to nan values in the label column if I kept them as empty strings. I figure the model might learn just as well how to determine okay comments. Just have to remember to clip that column before uploading to kaggle.
def label_it(row,labels):
label_out = []
for label in labels:
if row[label] != '':
label_out.append(label)
if len(label_out) == 0:
return 'okay'
else:
return ' '.join(label_out)
labels = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
relabel = {label:{0:'',1:label} for label in labels}
for label in labels:
df[label] = df[label].replace(relabel[label])
label_it_set = partial(label_it,labels=labels)
df['label'] = df.apply(label_it_set,axis=1)
df.to_csv('/data/toxic/train_labeled2.csv')
data = TextDataBunch.from_csv(path,'train_labeled2.csv',text_cols=['comment_text'],label_cols=['label'],label_delim=' ')