Lesson 10 nlp Creating the Classifier Data Loader: xxpad only results

leromerom · September 27, 2020, 6:38pm

lesson #10 1st Part 2020
Creating Classifiers DataLoaders

I have checked everything the original Notebook, the video everything is the same.
When trying to build for sentiment analysis with this DataBlock:

dls_clas = DataBlock(
blocks=(TextBlock.from_folder(path,is ,vocab=dls_lm.vocab),CategoryBlock),
get_y = parent_label,
get_items=partial(get_text_files, folders=[‘train’,‘test’]),
splitter=GrandparentSplitter(valid_name=‘test’)
).dataloaders(path, path=path, bs=128, seq_len=72)

And then reviewing the batch

dls_clas.show_batch(max_n=5)

The first element seems find but starting the second element I only see this:

xpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad

orendar · September 27, 2020, 9:56pm

Hey Luis,

This is a common problem when your inputs have varying sizes and thus different amounts of padding - see here for the solution

leromerom · September 28, 2020, 12:41am

Thank you, but it seems is the problem persist. something is wrong when building the dls_clas DataBlock

When running the next step on book 10 to build the text classifier learner:

learn = text_classifier_learner(dls_clas,AWD_LSTM,drop= 0.5, metrics=accuracy).to_fp16()

I get
train_loss: nan, valid_loss: nan accuracy:0.5

slow_run · September 28, 2020, 8:25am

@leromerom : Are we running into the same problem? URLs.IMDB => xxpad xxpad xxpad xxpad xxpad xxpad xxpad

In particular, if you execute:

from fastai.text.all import *
dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')
dls.show_batch()

do you get the same xxpad xxpad xxpad ... problem ?

leromerom · September 28, 2020, 10:29pm

Maybe not exactly the same. I am running lesson 10 nlp notebook. I can run with no problems all the exercises up to the creation of the language model and make samples from it, but when I try to run the classification model for sentiment analysis, creating the dls_clas Datablock then I run into this problem.

zerotosingularity · September 29, 2020, 3:52am

Could you give a little more details?

leromerom · September 29, 2020, 3:57am

Hey
Please see the top of this thread where I show the details thanks

zerotosingularity · September 29, 2020, 4:06am

Sorry, I thought it was an additional question.

You should be able to solve that by:

dls.show_batch(max_n=10, trunc_at=3000)

You can change the trunc_at value to see more or less of each sample.

More details
here as indicated in the first comment.

leromerom · September 29, 2020, 1:47pm

Thank you, I doesnt’ work. The problem is not when displaying I believe, but rather when when building the dls_clas DataBlock, since when I train it fails to do so.

arampacha · October 2, 2020, 7:16pm

The issue persist. I hacked a function to display text without padding:

def show_batch_text(dls, max_n=10, ctxs=None, trunc_at=150, unpad=True, **kwargs):
    b = dls.one_batch()
    x, y, samples = dls._pre_show_batch(b, max_n=max_n)
    if ctxs is None: ctxs = get_empty_df(min(len(samples), max_n))
    # next line removes padding
    if unpad: samples = L((TitledStr(s[0].replace('xxpad', '').strip()),*s[1:]) for s in samples)
    if trunc_at is not None: samples = L((s[0].truncate(trunc_at),*s[1:]) for s in samples)
    for i in range_of(samples[0]):
        ctxs = [b.show(ctx=c) for b,c,_ in zip(samples.itemgot(i),ctxs, range(max_n))]
    display_df(pd.DataFrame(ctxs))
    return ctxs

It would be nice to to add an option to remove padding from decoded text. Not sure what is the best place to do it, may be in dls._decode_batch or in show_batch[TensorText] (similar to what I did above).
I wonder if I’m missing something and there is a reason no to do so?