Lesson 10 nlp Creating the Classifier Data Loader: xxpad only results

lesson #10 1st Part 2020
Creating Classifiers DataLoaders

I have checked everything the original Notebook, the video everything is the same.
When trying to build for sentiment analysis with this DataBlock:


dls_clas = DataBlock(
blocks=(TextBlock.from_folder(path,is ,vocab=dls_lm.vocab),CategoryBlock),
get_y = parent_label,
get_items=partial(get_text_files, folders=[‘train’,‘test’]),
splitter=GrandparentSplitter(valid_name=‘test’)
).dataloaders(path, path=path, bs=128, seq_len=72)


And then reviewing the batch

dls_clas.show_batch(max_n=5)

The first element seems find but starting the second element I only see this:

xpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad

Hey Luis,

This is a common problem when your inputs have varying sizes and thus different amounts of padding - see here for the solution :slight_smile:

1 Like

Thank you, but it seems is the problem persist. something is wrong when building the dls_clas DataBlock

When running the next step on book 10 to build the text classifier learner:

learn = text_classifier_learner(dls_clas,AWD_LSTM,drop= 0.5, metrics=accuracy).to_fp16()

I get
train_loss: nan, valid_loss: nan accuracy:0.5

@leromerom : Are we running into the same problem? URLs.IMDB => xxpad xxpad xxpad xxpad xxpad xxpad xxpad

In particular, if you execute:

from fastai.text.all import *
dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')
dls.show_batch()

do you get the same xxpad xxpad xxpad ... problem ?

Maybe not exactly the same. I am running lesson 10 nlp notebook. I can run with no problems all the exercises up to the creation of the language model and make samples from it, but when I try to run the classification model for sentiment analysis, creating the dls_clas Datablock then I run into this problem.

Could you give a little more details? :slight_smile:

Hey
Please see the top of this thread where I show the details thanks

Sorry, I thought it was an additional question.

You should be able to solve that by:

dls.show_batch(max_n=10, trunc_at=3000)

You can change the trunc_at value to see more or less of each sample.

More details
here as indicated in the first comment.

Thank you, I doesnt’ work. The problem is not when displaying I believe, but rather when when building the dls_clas DataBlock, since when I train it fails to do so.

The issue persist. I hacked a function to display text without padding:

def show_batch_text(dls, max_n=10, ctxs=None, trunc_at=150, unpad=True, **kwargs):
    b = dls.one_batch()
    x, y, samples = dls._pre_show_batch(b, max_n=max_n)
    if ctxs is None: ctxs = get_empty_df(min(len(samples), max_n))
    # next line removes padding
    if unpad: samples = L((TitledStr(s[0].replace('xxpad', '').strip()),*s[1:]) for s in samples)
    if trunc_at is not None: samples = L((s[0].truncate(trunc_at),*s[1:]) for s in samples)
    for i in range_of(samples[0]):
        ctxs = [b.show(ctx=c) for b,c,_ in zip(samples.itemgot(i),ctxs, range(max_n))]
    display_df(pd.DataFrame(ctxs))
    return ctxs

It would be nice to to add an option to remove padding from decoded text. Not sure what is the best place to do it, may be in dls._decode_batch or in show_batch[TensorText] (similar to what I did above).
I wonder if I’m missing something and there is a reason no to do so?

1 Like