MultiClass text classification

Nubbinsonfire · November 27, 2018, 11:51am

Hello,

Is it possible to create a multi-class text classifier for sentiment analysis, and how well does it work with the ULMfit model? I want to try one of the Semeval tasks ( currently can’t find the link), for multi-class classification.

Any advice or that would be great!

cwerner · November 27, 2018, 8:48pm

I did start a multi-class document classifier with the wiki > custom docs > classification approach introduced in the Idbm lesson.

My corpus is kind of complicated though so my results are only about 60% in the classification stage.

Or are you talking about something different?

Nubbinsonfire · November 28, 2018, 10:08am

I got something kind of working early last night. I don’t know if the library will have been updated since last night. but I created a text classifier

data = TextClasDataBunch.from_csv(path,‘train.csv’, text_cols = ‘Tweet’, label_cols=‘classes’, label_delim=’ ')

as the csv was similar to the planets csv file fr categories.

I had to make a couple more adjustments, including changing the loss function and metrics to:

learn.loss_func = nn.BCEWithLogitsLoss()
learn.metrics = [acc_02]

else it spat out an error when I tried to train it.

I will have a play about later to day to see how it works!

xbno · February 27, 2019, 6:23am

This idea helped me out a lot. I was able to load multiclass labels is through a single column and then use a delimiter like your example and the planets lesson. Here’s how I shuffled around the labels when working with the kaggle toxic comments competition.

First, I relabeled the one-hot encoded columns with their actual meaning. Then I joined them in another column called label. Resaved the df, and then was able to use TextDataBunch to load it properly.

In this case, and probably lots of others, there were many samples that weren’t labeled at all so I labeled them as okay. I wonder if theres any reason not to do this? Loading the data via TextDataBunch was erroring due to nan values in the label column if I kept them as empty strings. I figure the model might learn just as well how to determine okay comments. Just have to remember to clip that column before uploading to kaggle.

def label_it(row,labels):
    label_out = []
    for label in labels:
        if row[label] != '':
            label_out.append(label)
    if len(label_out) == 0:
        return 'okay'
    else:
        return ' '.join(label_out)

labels = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
relabel = {label:{0:'',1:label} for label in labels}

for label in labels:
    df[label] = df[label].replace(relabel[label])

label_it_set = partial(label_it,labels=labels)
df['label'] = df.apply(label_it_set,axis=1)
df.to_csv('/data/toxic/train_labeled2.csv')

data = TextDataBunch.from_csv(path,'train_labeled2.csv',text_cols=['comment_text'],label_cols=['label'],label_delim=' ')