How to pass Multi Hot Label Array and classes to TextClasDataBunch constructors

I have a dataset with a multilabel classification problem. I have an array that contains a column for each label, and each entry has a 0 or a 1 for each label(Multi Hot Array). I also have a list of the labels associated to each columns. When I pass those labels and the list of labels to a databunch constructor, it doesn’t create the databunch I want. It calls src.label_from_list where src is an ItemLists on my thing which expects labels to be given in the form of a list of labels and to create the array itself. At the moment I convert my multi_hot_array back to a list of labels.

Is there a better way to proceed?

PS : The reason I would like another way to proceed is that i’m also performing some transformation on the labels and those transformation are at the moment happening on a multi-hot-array. So i’m currently doing list_of_labels -> multi_hot_array —transform—> multi_hot_array -> list_of_labels -> Databunch. It seems ‘natural’ to me to be able to pass a multi_hot_array to a Databunch. What do you think of that?

I think you want the magic attribute one_hot=True (to pass in label_from_list)

That’s great! Had no idea this attribute existed. You can’t pass it through the Databunch constructor to the label_from_list though. I guess i’ll override the DataBunch.from_tokens then. Thanks!

You have to learn the data block API young padawan :wink:
Note that if your labels are arrays though, it should automatically recognize one-hot encoding as default.

My labels are multi-hot encoding, so I guess that’s why it’s not recognized.

I ended up creating a new Databunch method databunch.from_tokens2. Basically the same code, but I added the one_hot argument and passed that argument to label_from_lists call.

def from_tokens_2(cls, path:PathOrStr, trn_tok:Collection[Collection[str]], trn_lbls:Collection[Union[int,float]],
             val_tok:Collection[Collection[str]], val_lbls:Collection[Union[int,float]], vocab:Vocab=None,
             tst_tok:Collection[Collection[str]]=None, classes:Collection[Any]=None, max_vocab:int=60000, min_freq:int=3,
             one_hot=False, **kwargs) -> DataBunch:
    "Create a `TextDataBunch` from tokens and labels. `kwargs` are passed to the dataloader creation."
    processor = NumericalizeProcessor(vocab=vocab, max_vocab=max_vocab, min_freq=min_freq)
    src = ItemLists(path, TextList(trn_tok, path=path, processor=processor),
                    TextList(val_tok, path=path, processor=processor))
    src = src.label_for_lm() if cls==TextLMDataBunch else src.label_from_lists(trn_lbls, val_lbls, classes=classes, one_hot=one_hot)
    if tst_tok is not None: src.add_test(TextList(tst_tok, path=path))
    return src.databunch(**kwargs)

Set this new method to the TextDataBunch Class

setattr(TextDataBunch, 'from_tokens2', classmethod(from_tokens_2))

Call the new constructor

data_clas = data_clas = data_clas = TextClasDataBunch.from_tokens2(..., one_hot=True)

This works perfectly. Thank you again, and data block API is definitely on the top the list of things I need to practice with in fastai now :slight_smile: