[Text] The reasons why I want to show pad in `show_batch`

Richard-Wang · May 19, 2020, 11:31pm

Below is my personal opinions, I would like to make a pr, but before that, what is your opinions ?

show_batch didn’t tell me what my model will see in a batch.

Actually there is variable number of pads behind sentences.

I can’t teach other very newbies what is actually passed to your model with this beautiful visual.
I can’t 100% use show_batch to confirm my data loading is correct.

If not showing the pad is to prevent too many words showed, we may be able to just use trunc_at, which limited number of words showed per cell.
If your PAD is not xxpad, it still appears.

And it seems that we can’t change PAD (?), so this may be meaningless.(?)

PS. Here is where this happened.

class Numericalize(Transform):
...
    def decodes(self, o): return L(self.vocab[o_] for o_ in o if self.vocab[o_] != PAD)
                                                                   !!! ↑↑ here ↑↑ !!!

sgugger · May 20, 2020, 11:33am

In practice, with the first batch having the longest texts in classification, you’d see only xxpad on all your texts when inspecting the IMDB dataset.
I’d suggest using another transform where you change this or just monkey-patching Numericalize.

Richard-Wang · May 20, 2020, 12:24pm

Thanks your reply!

So we don’t show pad because there is lots of pad in maybe samples after 7 th because text in 1st sample is super long (when using SortedDL)?

Can we modify it to be able to pass sth like pad_token to TextBlock through kwargs and then toNumericalize.__init__, and Numercalize.decodes discard pad only when pad_token is not None ?
So

We can choose show pad or not easierly.
Even if your pad is not xxpad, we still can choose not to show it(e.g. pad_token=‘[PAD]’).

sgugger · May 20, 2020, 12:26pm

Sure, that’s a reasonable behavior. Go ahead with a PR if you want to add this change.