Optimising class LanguageModelLoader()

I went for walk to think about some observations that have been nagging me:

  • As i understand it the current fastai chop through the rag in a more discontinuous way than the LanguageModelLoader i propose - with better accuracy as a result
  • I made an experiment where each batch started at the beginning of a rag hoping it would improve accuracy - it didn’t. Guess that the better alignment during training reduces the models generalisation!

This makes me wonder whether the current fastai implicitly makes a sort of inter batch cut-out of tokens thus making the RNN more robust to alternative variations of a sentence.

I could implement inter- and intra- batch cut-out relatively easily to see if that can explain the difference i accuracy ?

Right now i am making 10 epoch with the current fastai and my LanguageModelLoader with bptt=130 and p_btt=0 to get a new baseline. That will take about 11 hours. Depending on the results i could run a cutout experiment after that

On the batch dimension, yes, but on the sequence dimension it’s continuous. In your implementation the batches are a continuous chunk of text but we don’t care about that in an RNN, what is important to take advantage of the hidden state is to have the tokens come continuously across batches.

that could be done:
-It would require 2*bs ints to hold the offsets into the rags instead of 2 ints
-probably be a bit slower because its is more expensive to index an array than addressing a local int

I completely missed that pytorch/fastai maintains the internal stat of the rnn across batches pr row - soo embarrassing. i’ll adjust my approach.

@Kaspar great writing, I see you are putting quite some effort in optimising the memory consumption and speed I love this kind of fiddling myself, and It makes a lot of sense for things that you use in production. When you train readability beats performance. I don’t mind an additional hour of training if can avoid running the training (20h) again because of a bugs in a way I’ve trained LM.

What I find interesting in your idea is the ability to train on the whole wikipedia. I know that Jeremy reported that this does not improve on the down stream classification tasks. But given the reports from Bert (they attribute their performance to the large dataset they have used).
It may be helpful to exmperiment with it a bit more on different tasks NER, NLI and network types QRNNs,Transformers. So being able to train the model on arbitrary large amount of data is quite interesting.

Have you considered using memory mapped numpy arrays to keep the readability of the code and to be able to handle larger data sets? here is pytorch thread about it This won’t give you the performance improvements but it will let us train on the whole wikipedia. And if you think about it a bit longer you might find a way to get the performance improvement as well.

Regarding your question why it is important to load a larger batch first. It is because memory GPU management of pytorch, it reduces the memory fragmentation and lets you avoid OOM exceptions.

I hope this helps, and keep up good work and effort to make fast.ai better!

I can also confirm that for classification tasks, a small portion of wikipedia was good enough (22m, sampled from full). I am looking forward to see the great work of Kasper (and your lm) integrated into fastai. Still getting OOM for larger wiki texts (GPU Optimizations Central).
One quick question @piotr.czapla : Am I correct that current fastai (v 1.0.39) does not read bilstm models?

It is hard to figure out how to train multi-layerd BiLSTM on language modeling in an efficient so I dont think it will be supported anytime soon. in ulmfit-multilingual I’ve implemented training backward and forward LM at once, following ELMO paper, but it uses huge amount of gpu memory, but it should work with the current fastai version.

You mean by setting bidir = True and testing classifier via ulmfit.train_clas (https://github.com/n-waves/ulmfit-multilingual/blob/master/tests/test_end_to_end.py#L117)?

Yep, it was working fine when i was testing on GPU with 32GB of ram now when I’m down to 12GB so I’ve postponed the work on bidir until i get the XNLI baseline done.

1 Like

Thx that was a tricky one.
I have adjusted the loading of batches in order ensure continuity pr row in the batch across batches + simplifying the code. I am fiddling to achieve the same accuracy as the current version of fastai.

I have previously used memory mapped files but i do not believe it is required here, because we just run through the data one small batch at a time.

Kasper, I meant to keep the existing code of data loader with minimal modifications to use memory mapped numpy. Your code is super smart but hard to read compared to Sylvains.

If you use mmaped numpy you may get the same objective and keep the readability unchanged.

i agree that indexing through the ragged arrays still looks complicated and i am not sure whether the code or part of it will be submitted as a PR. That will be up to you, sylvain and the rest of the community.

However, i have really learned a lot about python, pytorch and rnn going to this level of detail.

I am posting from my previous post from here.

Fine-tuning the LM takes an extraordinarily long time with a decreased efficiency. Please checkout this notebook. I think this was run nearly a 20 days ago. If you scroll down to the first epoch (after lr_find), you can see it took a little over 2 hours. And after running for 11 epochs which took about 28 hours, the accuracy is aboud 0.583.

However, during my latest run (about 10 days ago), a single epoch took over 12 hours and the full 11 epochs took over 5 days! And the accuracy was nearly 13 points lower (0.449). I used the same dataset with the same exact code. Everything was run in one session (ie, no closing the notebook and loading in the data again).

Everything was run on a system equipped with a V100 (16G of video RAM) and 376G of memory with the latest dev version of FastAI (I always do git pull; pip install -e.[dev] before I start my work). Any idea whats going on?

i have no idea . I do know that sgugger changed the tensor layout 1-2 weeks ago so that batch column is now the first. but that should not change the accuracy and i cannot imaging that it will change performance that much.

I am close to having a new version of MyLanguageModel so when i am finished we could test it out to see if that makes a difference, I would not expect so unless memory fragmentation is the root cause of what you see concerning performance

How is your language model different than the one provided by FastAI?

Could you elaborate more on that?

Perhaps @sgugger can shed some light on this. I could use some pointers on where/how to debug this problem.

I do not think that the LanguageModelLoader is the root cause because sgugger already made a major reduction in peek memory.

My version allocate memory for a tensor storage area and the then fill this storage without allocating new memory.

I’m curious whether memory reduction is the cause of performance (both in terms of runtime and accuracy) issues that I’m having, given that the system I’m on has adequate onboard memory. Unfortunately, I’m very much a FastAI user than a dev, so its not immediately clear where the problem could lie.

I had my system admins reboot the system as the system had been on for nearly a month. But a reboot didn’t help with the issue.

nvidia-smi revealed that the Cuda driver the system was using was version 10.0 and I installed Pytorch using conda install -yc pytorch pytorch torchvision cuda100. I’m curious whether the version 10.0 has anything to do with (although if I recall correctly, the earlier version of the code that I ran, also ran on Cuda 10.0).

I suggest using a limited amount of data to test. This thread (GPU Optimizations Central) can help for memory tracking.

another change a couple of weeks ago was that sgugger implemented a change to reduce the memory peek