Setting up a Textblock from a dataframe for multi-label classification

lshust · September 10, 2021, 1:33pm

I am receiving the following error when trying to set up data loaders from the toxic comments dataset using a dataframe that I setup from the .csv file.

AttributeError: ‘list’ object has no attribute ‘truncate’

def get_y(r): return r[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]]
block = DataBlock(
  blocks=(TextBlock.from_df("comment_text", is_lm=True), MultiCategoryBlock),
  get_x=ColReader("text"),
  get_y=get_y,
  splitter=RandomSplitter(.1)
)

dls = block.dataloaders(train, bs=64)
dls.show_batch()

This error comes from the show_batch method. I have also tried using ColReader for get_y, because I suspect that the way I’m retrieving the dependent variable is the problem.

ColReader(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])

This is how I import my data into a dataframe :

And then I do a train-test-split :

from sklearn.model_selection import train_test_split

train, val = train_test_split(df, test_size=0.05)
train.shape, val.shape

Any idea how I can change my DataBlock to avoid this error? What am I missing?

Conwyn · September 11, 2021, 7:28pm

Hi Levi
Try change get_y to return not a [[]] but “toxic;severe_toxic…”
and add to DataBlock label_delim=";"

So the label is a string with the word(s) present if the item(s) is true.

Regards Conwyn