Multilabel classifier misses category title

sebastianzeki · May 6, 2021, 7:57am

I have a simple dataframe with text in column 1 and classifiers in column 2 separated by a ‘;’

eg:

my_text                                                                      tag_list
I went to the shops and I couldnt find the dog I was looking for            G421;Z272
I am really not a fan. He looked odd to me.                                 Z241
What's the answer then?                                                     H221;H206

I load this up in to a dataloader as follows:

    dls_clas = TextDataLoaders.from_df(df=df_c_OPCS4, path='/content/gdrive/MyDrive/Colab_data', 
                                       valid_pct=0.2,
                                       text_vocab=dls_lm.vocab, 
                                       text_col='my_text', 
                                       label_col='tag_list', 
                                       label_delim=";", 
                                       y_names='tag_list',
                                       y_block=MultiCategoryBlock())

however the output I get when I show_batch gives me two columns called ‘text’ and ‘None’. I think I should be expecting the column name to be ‘category’.

This means that downstream functions don’t work eg. when I run the text classifier and then learn.show_results() I get an error:

/usr/local/lib/python3.7/dist-packages/fastai/torch_core.py in show_title(o, ax, ctx, label, color, **kwargs)
    462         ax.set_title(o, color=color)
    463     elif isinstance(ax, pd.Series):
--> 464         while label in ax: label += '_'
    465         ax = ax.append(pd.Series({label: o}))
    466     return ax

TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'

I think the error is because there is no label to append to the Series because the category name is missing.

How can I ensure that the category column is labelled when I set up the MultiCategoryBlock in the dataloader?

Conwyn · May 6, 2021, 4:32pm

Hi Sebastian

You are correct. Copying other people’s posts I tried this.
So I take a dual category dataset.
Copy into pandas. add extra categories.
It finds the six categories and predict two true and four false and is wrong.
Show batch show “None” as the categories but with the semi-colon.

path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/“texts.csv”)

for i in range(df.count()[0]) :
if df.iloc[i,0] == ‘positive’ :
df.iloc[i,0] = ‘positive;very;happy’
else:
df.iloc[i,0] = df.iloc[i,0]+’;really;sad’

dls = TextDataLoaders.from_df(df=df, text_col=‘text’, label_col=‘label’, valid_col=‘is_valid’,label_delim=";",y_block=MultiCategoryBlock )
dls.show_batch(max_n=3)

dls.vocab[1]

[‘happy’, ‘negative’, ‘positive’, ‘really’, ‘sad’, ‘very’]

learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, n_out=len(dls.vocab[1]), metrics=[])

learn.predict(df.iloc[0,1])

((#2) [‘happy’,‘sad’],
tensor([ True, False, False, False, True, False]),
tensor([0.5182, 0.4725, 0.4987, 0.4711, 0.5009, 0.4948]))

Regards Conwyn

sebastianzeki · May 6, 2021, 4:48pm

Thanks @Conwyn. That certainly spells out the problem but I think you have the same error as me. When I try learn.show_results on your code I get the same error that I got. I really struggled to find an example of a MultiCategoryBlock with delimiters such as I have that gives me a show_result and gives me batch predictions for multiple labels. I appreciate however that learn.predict does give me a multilabel prediction…

I suppose I can live with not using show_results as long as I could get get_preds working so I can get a batch output. At the moment I only get one label as a prediction rather than a combination prediction. If you have any suggestions for the following to get the batch multi-label predictions I’d be mighty grateful:

real_dlOPCS4=learn_text.dls.test_dl(df_c_OPCS4['my_text'])
nameOPCS4,pred_classOPCS4 =learn.get_preds(dl=real_dlOPCS4)
preds_maxOPCS4 = nameOPCS4.argmax(dim=1)
namesOPCS4 = [learn.dls.vocab[1][p] for p in preds_maxOPCS4]
df_c_OPCS4["CategoryOPCS4"]=namesOPCS4

Conwyn · May 6, 2021, 9:10pm

Hi Sebastian
My example returns six values two greater than 0.5 and four below 0.5 so it appears that fastai is using a threshold of 0.5. This is discussed in chapter 6 of the book.

To add to the confusion changing the drop_mult keyword improves things
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.125, n_out=len(dls.vocab[1]), metrics=[])

learn.predict(df.iloc[0,1])
((#3) [‘happy’,‘positive’,‘very’],
tensor([ True, False, True, False, False, True]),
tensor([0.5225, 0.4917, 0.5102, 0.4830, 0.4884, 0.5102]))

I understand your comment about show result. I think Custom ItemList | fastai suggests that customization is required.

Regards Conwyn

sebastianzeki · May 7, 2021, 7:11am

Thanks @Conwyn that is really helpful. Ill dig through that customisation documentation. The one thing I dont quite understand is why the code I have for get_preds only gives me the first predicted label rather than all of the predicted labels for a text input (as learn.predict does). Any idea why that might be? I think the reason might be with this line:

namesOPCS4 = [learn.dls.vocab[1][p] for p in preds_maxOPCS4]

although nothing I do seems to give me anything other than the first predicted label for each text.

I think if I can get the get_preds output I can live without a show results output

Conwyn · May 7, 2021, 9:42pm

Hi Sebastian

I think show_results is just wrong. As far as I can tell it has a Panda series of labels and if you try to add a duplicate it includes a ‘_’ for uniqueness. learn.predict maps the tensor to true false and then true false selects the vocabulary but show_results appears to be using the string category which would be a single item but in our case it is None.

Regards Conwyn

Conwyn · May 9, 2021, 4:58pm

Hi Sebastian

Two ideas

for i in range(7):

ll=learn.predict(df.iloc[i,1])

print(f’{ll[0]} {df.iloc[i,1]}’)

[‘negative’, ‘really’] Un-bleeping-believable! Meg Ryan doesn’t even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh… Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!

OR

def quick(z):

x=[]

for i in z:

if i<0.5 :    #Threshold value

  x.append(False)

else:

  x.append(True)

return(tensor(x))

[dls.vocab[1][quick(i)] for i in preds_maxOPCS4]

sebastianzeki · May 11, 2021, 4:51pm

Thanks @Conwyn. I got it working with the for loop. Its not the fastest way but it certainly gives me the multilabel predictions for rows of text.I’ll optimise it later- thanks again

Conwyn · May 11, 2021, 7:50pm

Hi Sebastian

For my own amusement I was thinking about embeddings. You may remember in the German supermarket example the embedding discovered geography. So what are embeddings. Are the they tricks of the data. So I took the Movie Lens example with user/movie rating and followed the book to create 50 latent factors. Next I selected the five largest weight for each movie. Then I chose 100 movies with the highest highest weight.

I found Python package imdbpy which allows you to read the imdb reviews by movie title.
I assigned to each movie its five highest latent factors.

I replicated you technique to create a categorization model and then fed the movies into it to find the new latent factors for each review.

Next I took the new latent factors for all the review of the movies and produced a histogram.

I am just fine tuning the language model but so far it has been quite promising.

My thought was for bootstraping. Imagine I asked you to write a review for your perfect movie. From that we could produce the latent factors and the recommend a movie with similar latent factors.

Regards Conwyn