Text classification for multi-label problem results in wrong target dimensions

Attempting to build a multi-label dataset for text classification but the dimensions of the target batch size is wrong. It shows (40,2) when it should be (40,8)

Is there something else I need to do in the data block API to let it know this is a multi-label problem? Right now its determining the # of columns by distinct values (eg. 2 because they are all 0 or 1) instead of the number of labels (e.g., 8)

2 Likes

SOLVED (but this could be better incorporated into the framework code)

Had to muck with my dataframe to get this to work:

train_df['labels'] = train_df[LABELS_SENT[1:]].apply(lambda x: ' '.join(x.index[x.astype(bool)]), axis=1)
valid_df['labels'] = valid_df[LABELS_SENT[1:]].apply(lambda x: ' '.join(x.index[x.astype(bool)]), axis=1)

train_df['labels'].head()

returns …

0      IsNegative IsSuggestion
1      IsNegative IsSuggestion
2    IsVeryNegative IsNegative
3                 IsSuggestion
4                 IsSuggestion
Name: labels, dtype: object

Then your data block needs to be updated to be something like this:

data_clas = (ItemLists(path=CLS_PATH,
                     train=TextList.from_df(train_df, path=CLS_PATH, col=corpus_cols, processor=cls_processor),
                     valid=TextList.from_df(valid_df, path=CLS_PATH, col=corpus_cols, processor=cls_processor)
                    )
           .label_from_df(col='labels', classes=LABELS_SENT[1:], sep=' ')
           .databunch(bs=bsz)
          )

… and enjoy your multi-label classification training :slight_smile:

For the core fastai dev team, I’d recommend that we simply incorporate what I did into the framework … so that if folks pass in multiple columns into .label_from_df, it would run the code at the top of this post. If you’d like a PR I’m glad to do it, but given this is a major change I leave it to you to lmk before I do anything.

2 Likes

Can you specify what is LABELS_SENT ?? It’s not clear

Its just a list of my labels (e.g., ['is_pos', 'is_neg', 'is_cool', ... ] )

@wgpubs: just noticed this thread of yours

I think we have already been through this particular issue before :slight_smile: – you are correct, it did break with the new datablock API (as a lot of redundant code was refactored)

got fixed yesterday:

1 Like

This still doesn’t look right when using the approach described in the initial post.

Is there something I’m missing?

cls_processor = [
    TokenizeProcessor(tokenizer=tokenizer, chunksize=chunksize),
    NumericalizeProcessor(vocab=vocab)
]

data_clas = (ItemLists(path=CLS_PATH,
                     train=TextList.from_df(train_df, path=CLS_PATH, cols=corpus_cols, processor=cls_processor),
                     valid=TextList.from_df(valid_df, path=CLS_PATH, cols=corpus_cols, processor=cls_processor)
                    )
             .label_from_df(cols=LABELS_SENT[1:], classes=LABELS_SENT[1:])
             .databunch(bs=bsz)
          )

data_clas.save()

I guess the better question is: What should a DataFrame look like to support a mutilabel dataset?

Mine looks like something like this:

is_very_pos | is_pos | is_very_neg | is_neg | txt
      1     |   0    |      0      |    1   | I love cats but I hate dogs
      1     |   1    |      0      |    1   | ...
      0     |   0    |      1      |    1   | ...
      0     |   0    |      1      |    1   | ...
...

Looking at the latest codebase, I suspect it has something to do with the call one_hot that is part of the problem, not sure though.

How my DataFrame looks above makes sense to me, but perhaps it doesn’t make sense to fastai … so again maybe the/my real question is: “What should the Dataframe look like?”

1 Like