Reverse one hot encoding for multilabel classification

I am trying to use an external dataset for a multicategorical classification.

I have tried various approaches to be able to format it similar to the Pascal dataset used in Lesson 06.

The two most promising steps are:

  1. using str.cat

Screen Shot 2021-07-27 at 3.12.51 PM

However, I get individual characters when I go through the vocabulary:

I have tried various combinations, with and without text preprocessing, using Nan, using str(0), and it still gets that same segmented vocabulary.

  1. A For loop

Screen Shot 2021-07-27 at 3.07.09 PM

However, it stops with a single item. I have tried various combinations including empty lists, appends, ranges, while. I cannot seem to get the right combo to be able to gather all the animals present.

I also tried to directly pass the encoded columns directly as get_y, where a TypeError was generated:

I would appreciate advise as to how to best use data where the multi-label targets are already encoded.

Thank you!

Maria

I was able to find a way, but will still need to do some cleaning on the text.

If you can think of a way, I would appreciate your suggestion.

Thank you!

Maria