Masked_concat_pool doesn't make sense?

aeryen · November 29, 2019, 6:07pm

Hi all, this is my first post so i’ll try to format this correctly.
There is a ling of code in Learner.py masked_concat_pool that just doesn’t make sense to me:
Currently it looks like this

avg_pool = output.masked_fill(mask[:, :, None], 0).mean(dim=1)
avg_pool *= output.size(1) / (output.size(1)-mask.type(avg_pool.dtype).sum(dim=1))[:,None]

The average pool was first generated using “mean” then it was scaled by the inverse of the sequence length. This seems to imply that the shorter the sequence, the larger the mean should be scaled. I wonder if this is a trick to improve performance, or a bug because maybe someone assume the first line is using “sum”?

I believe the code is not like this from the class, it’s also not like this when fastai was still using pytorch adaptive 1d pooling?

aeryen · December 3, 2019, 3:50am

Sorry i was being silly because my brain couldn’t put masked_fill and mean together even though back then i understood them both.
basically the second line of code was intend to mitigate the fact all the padding area were filled with zeros and will hence drag down the mean towards zero.
However, this bring me one more question, for this code to work, we must assume the output of LSTM are indeed a zero mean distribution? otherwise we can’t fill them with 0. But at this stage the LSTM output were not normalized yet, can we really assume it is zero mean?

sgugger · December 3, 2019, 2:39pm

No, this is just because the mean is the sum divided by the number of elements, so adding zeros doesn’t do anything.

aeryen · December 3, 2019, 11:19pm

Ahh my god you are right! It IS indeed sum divided by # of elements.
Not trying to be nitpicking and i realize this is super minor stuff, but why then the code isn’t sum then divide? I recall the code used to be like that. Compare to “mean then times total then div by # of token”, the sum method would saves 2 operation?