Most of the items in show_batch is xxpad strings

tedks · September 15, 2020, 8:20am

I am facing an issue while loading the data from a custom csv file. Almost all are xxpad strings. I have created a vocab list from a file and i was trying to use the vocab on another csv file for multi category classification.
The multi category classification csv file is in the following format -

text, label_1, label_2, label_3, label_4, label_5, label_6.
“hello this is cool”, 1, 0, 0, 1, 0, 0

dls_lm is the language model previously trained and loaded.

dls_labels = DataBlock(
    blocks=(TextBlock.from_df('text',vocab=dls_lm.vocab), MultiCategoryBlock),
    get_x=ColReader("text"),
    get_y=ColReader([1,2,3,4,5,6]),
    splitter=RandomSplitter(0.2)
).dataloaders(train_label, bs=128, seq_len=72)

when I run dls_labels.show_batch() in first row’s first column is a proper string (After tokenizing) from the csv and in the Y column it’s the semi colon seperated values from label_x columns (1 or 0). After that all the rows have repeated xxpad strings and y column value same as first one.

What is I am doing wrong here ? Please help me understand ?

orendar · September 15, 2020, 8:45am

Hey Ted!

The short version is that it’s a known issue with show_batch - your dataloaders are probably fine, but since batches are sorted by length, the longest item might be batched together with shorter items thus resulting in show_batch showing all this padding. Try training and if it works then most likely everything is fine

tedks · September 15, 2020, 9:06am

Hey orendar,
thanks for the reply. my training was not working as expcted so I was trying to find any errors in thre previous steps. And the dataloader part looked dubious. Now I am back to square one. I am doubting the loading of label data is wrong here. In my data set a “1” in the column label_x means the text has classified as a particular label and 0 means it’s not.

I think the way I am loading the Dataloader is it wrong ? Because in the show batch the y column only has values (0;1). What is the proper way to read this type of csv for multi class classification, I have the list of column names (label_x) ? So based on the value in column choose the correct label from list.

orendar · September 15, 2020, 11:43am

Hey Ted,

If you are doing multilabel text classification, try something like this (I am not sure about the parentheses for the block and you can add your own metrics):
dls_clas = TextDataLoaders.from_df(df=df, valid_pct=0.2, seed=42, text_col='text', text_vocab=vocab, bs=bs, y_names=labels, y_block=MultiCategoryBlock(), label_col=labels)
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, n_out=len(labels), metrics=[])

tedks · September 15, 2020, 2:26pm

Hey thanks, But still it’s taking the values inside the columns instead of the labels from list. Producing a dls like following

data text2, 0;1
data text1, 0:1

Is it mandatory to put the category name in the columns values instead of a flag 1/0 ? I have tried with another dummy csv file which has category name in the column itself instead of 1/0 flag which worked.

orendar · September 16, 2020, 2:23pm

Hey Ted,

I don’t understand your last comment. Could you please try my method (with “labels” being a list of your label columns) and paste the full code together with the problem you’re trying to solve? I work a lot with multilabel text so hopefully I can provide some assistance there

tedks · September 18, 2020, 12:41pm

Hey Orendar, I had control over the data source. So I made my csv data in the format :
"text", "labels"
"text_body", "semi column seperated labels, ie label1;label2;label3".
I was able to do the create dataloader and do the classificaiton with out any hiccups with some parameter changes in the code you have provided.

dls_labels = TextDataLoaders.from_df(
    df=labeled_data, 
    valid_pct=0.2, 
    text_vocab=dls_lm.vocab,
    seed=42, 
    text_col='text', 
    label_col='labels', 
    label_delim=";", 
    y_block=MultiCategoryBlock, 
 )

Earlier my data was in following format, 1 in a label column represent whether that label is associated with that text. In this format I was not able to get it working with the code you have have provided. -

"text", "label1", "label2", "label3"
"text body", 1, 0, 0

orendar · September 18, 2020, 2:55pm

Hey Ted,

Glad you managed to get it to work - feel to mark this issue as resolved if you’re happy with the outcome

Otherwise if you ever want to debug the previous version of your data, feel free to tag me - pretty sure I managed to train models on dataframes in that format before.

zerotosingularity · September 24, 2020, 7:33pm

Just hit the same issue today, and here is how I was able to look into the entire sample (link to the post I commented on):