Use PackedSequence in RNNcore to improve pooling accuracy (and RNN speed)

Hey,

In the current implementation, the RNN output pooling calculates the average of the entire tensor including the padding. This pooling creates inconsistent behavior when predicting batches of texts (due to the change in the amount of padding).
This PR replaces the previous padding with PyTorch PackedSequence to pack and pad the texts in the RNN core. This change also improves the RNN speed.
We also use the lengths to calculate the average pooling while ignoring the padding.

Note: In the current implementation of MultiBatchRNN, predicting a batch that contains texts with more than 1*bptt difference in length is not recommended.

Here is the commit with my changes:

I’d appreciate your feedback!
Rani

p.s.
@jeremy @sgugger - this is a followup on my PR, this time without breaking the current implementation :slight_smile:

Thanks! What we actually need is a jupyter notebook showing your solution, how its used, and tests. We’ll then integrate it in to the library, and credit you in the CHANGES doc. You can share the notebook using the ‘gist it’ extension. Hope that works for you. :slight_smile:

It actually works with the imdb notebook without any changes, except for updating it to fastai v1. I can upload the updated imdb notebook :slight_smile:

I would love to see that - it would be really helpful to see how well it runs with this change.

Yes please. We would mostly need the full IMDB notebook to check it goes to same accuracy (94.7-94.8%). Here is the draft that was in dev to get the values of the hyper-parameters, the rest should be the same with the full dataset instead of the sample.

1 Like

Sorry for the late response. Here is the IMDB notebook with all the adaptation for fastai v1.
I reached accuracy of 92%, but I did only one cycle of training for the language model (it’s quite slow and I’m cheap :slight_smile: ). I believe we can reach the same accuracy with more training.

The notebook:

My changes:

Bump. I’d love to get your feedback on that :slight_smile:

Woops, sorry, forgot to do my due diligence on this one. Will try this out, probably tomorrow and report back.

2 Likes

So I tried it on the full imdb dataset. It does seem to help a tiny little bit for the final accuracy (though that could be random) but it’s actually slower than the current implementation. Shouldn’t it be faster?

Happy to hear that it help a bit (but mainly that it works :slight_smile: )
The LSTM layer should skip the padding so it should be a bit faster. Do you know which part makes it slower? Is it the Pytorch packing?

I have no idea, I just ran your code versus the current one on the same machine and found yours slower.

Significantly slower?
I’ll look into it again, maybe it’s something with the pytorch packing and padding.
Thanks!

Not significantly, like 10s to 1 minute slower depending on the number of unfrozen layers. I thought the whole point was to have some speed-up with this, that’s why I found it strange.

Actually that wasn’t the point imo. The reason I suggested this change was that the prediction results in production were unstable when predicting a batch of inputs together - the output of a single input varied depending on the amount of padding needed.
I also hoped it’d speed up things as well :frowning:

If I understand you correctly, the question you are trying to address is: if I changed the amount of padding for the same piece of text, would I get varying predictions? I love your idea of dividing by length as opposed to blind average.
If this was the case, the real test is whether your predictions remain consistent if you change the padding on a single line of text you are trying to predict on (and comparatively, if the current fastai model generates inconsistent predictions)
Now I haven’t read your code, I’m not sure why you wanted to do the padding in RNN core instead of PackedSequence, but if that is what is slowing the machine down, potentially, you can revert to the original version?

You’re right (If I understand you correctly :slight_smile: ), changing the amount of padding affects the predictions. It’s relevant only if you want to predict a batch of inputs, because otherwise you don’t need to pad the input.

I think you’re right again :), we probably can revert to the original version without the PackedSequence and keep only the new pooling. I’ll have to look into it though