Most of the items in show_batch is xxpad strings

I am facing an issue while loading the data from a custom csv file. Almost all are xxpad strings. I have created a vocab list from a file and i was trying to use the vocab on another csv file for multi category classification.
The multi category classification csv file is in the following format -

text, label_1, label_2, label_3, label_4, label_5, label_6.
“hello this is cool”, 1, 0, 0, 1, 0, 0

dls_lm is the language model previously trained and loaded.

dls_labels = DataBlock(
    blocks=(TextBlock.from_df('text',vocab=dls_lm.vocab), MultiCategoryBlock),
).dataloaders(train_label, bs=128, seq_len=72)

when I run dls_labels.show_batch() in first row’s first column is a proper string (After tokenizing) from the csv and in the Y column it’s the semi colon seperated values from label_x columns (1 or 0). After that all the rows have repeated xxpad strings and y column value same as first one.

What is I am doing wrong here ? Please help me understand ?

Hey Ted!

The short version is that it’s a known issue with show_batch - your dataloaders are probably fine, but since batches are sorted by length, the longest item might be batched together with shorter items thus resulting in show_batch showing all this padding. Try training and if it works then most likely everything is fine :slight_smile:

1 Like

Hey orendar,
thanks for the reply. my training was not working as expcted so I was trying to find any errors in thre previous steps. And the dataloader part looked dubious. Now I am back to square one. I am doubting the loading of label data is wrong here. In my data set a “1” in the column label_x means the text has classified as a particular label and 0 means it’s not.

I think the way I am loading the Dataloader is it wrong ? Because in the show batch the y column only has values (0;1). What is the proper way to read this type of csv for multi class classification, I have the list of column names (label_x) ? So based on the value in column choose the correct label from list.

Hey Ted,

If you are doing multilabel text classification, try something like this (I am not sure about the parentheses for the block and you can add your own metrics):
dls_clas = TextDataLoaders.from_df(df=df, valid_pct=0.2, seed=42, text_col='text', text_vocab=vocab, bs=bs, y_names=labels, y_block=MultiCategoryBlock(), label_col=labels)
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, n_out=len(labels), metrics=[])

1 Like

Hey thanks, But still it’s taking the values inside the columns instead of the labels from list. Producing a dls like following

data text2, 0;1
data text1, 0:1

Is it mandatory to put the category name in the columns values instead of a flag 1/0 ? I have tried with another dummy csv file which has category name in the column itself instead of 1/0 flag which worked.

Hey Ted,

I don’t understand your last comment. Could you please try my method (with “labels” being a list of your label columns) and paste the full code together with the problem you’re trying to solve? I work a lot with multilabel text so hopefully I can provide some assistance there :slight_smile:

Hey Orendar, I had control over the data source. So I made my csv data in the format :
"text", "labels"
"text_body", "semi column seperated labels, ie label1;label2;label3".
I was able to do the create dataloader and do the classificaiton with out any hiccups with some parameter changes in the code you have provided.

dls_labels = TextDataLoaders.from_df(

Earlier my data was in following format, 1 in a label column represent whether that label is associated with that text. In this format I was not able to get it working with the code you have have provided. -

"text", "label1", "label2", "label3"
"text body", 1, 0, 0

Hey Ted,

Glad you managed to get it to work - feel to mark this issue as resolved if you’re happy with the outcome :slight_smile:

Otherwise if you ever want to debug the previous version of your data, feel free to tag me - pretty sure I managed to train models on dataframes in that format before.

Just hit the same issue today, and here is how I was able to look into the entire sample (link to the post I commented on):