Optimising class LanguageModelLoader()

(Kaspar Lund) #1

The purpose of this topic is to optimise the memory footprint of the class LanguageModelLoader while maintaining or improving the accuracy of ULMFIT.

I ran into the memory issues in LanguageModelLoader because it went on to create a notebook to process an entire wikipedia for any language using sentencepiece+fastai. There has been so many obstacles in making this work for the entire english wikipedia that I still havn’t used it to prediction on imdb. However the memory footprint has now been reduce to 1% of the fastai version i started with early december. The entire english wikipedia can now be processed give enough GPU days/weeks.

i am very nervouse about making changes that could reduce the excellent results of ULMFIT and hope that the community can help ensure/verify that this contribution preserves or improves the accurary/convergence. I do believe that there is a good chance that accuracy or convergence could be improved, because it is now 1) so much easier to randomize the sequence length of the batches and 2) what token a batch/sequence should begin - the latter will require a little fine tuning.

Lets get to it:
@sgugger already optimised it considerably by generating a batches on the fly instead of batchifying the entire ragged array of ids. This proposal https://github.com/kasparlund/nlp/blob/master/languagemodelloader.py goes a step further and simplifies the “def iter”.
@piotr.czapla has also made optimisations using week reference i the calling training loop i think

The memory footprint has been minimised by avoiding to copy source data except when a batch has to be filled with data and by mimimising allocation in general:

  1. by using a index array with inplace shuffle and direction of the iterations (forwards vs backwards)
  2. by preallocating storage for the batch an reusing this storage in every iteration. The cost is a loop that moves through the ragged array in “def fill_buffer” batch by batch. The loop has been optimised in order to preserve speed by reducing the number of book-keeping parameters and using local variables where it payed of (a local “i” is about twice as fast a “self.i”)
  3. the batch storage is a np.long array for now, because pytorch cannot (at this momemt) use slices in assignment. It can however used a numpy array as storage and build views of torch slices in order to deliver the x and y to the calling training loop-without copying the storage. I hope to switch to using af LongTensor when pytorch implements slicing in assigmment like self.buffer[ibuf:ibuf+rl] = rag[r0:r1] with self.buffer as a Longtensor
  4. The source data can use the smallest np.dtype for the size of the vocab: int16 for a 32000 vocab and possibly uint16 for a 64000 vocab-i have yet to test the latter. This saves a lot of memory and is now possible because @sgugger has removed the Intprocessor TextDataBunch.from_ids constructor.

The iter loop has been simplified by using a Circular indexer: “When the index exceeds the length of self.idx then it is wrap to start at the head or the end if the indexing i set to move backwards”. This results in the following advantages.

  1. shuffle and index direction is hidden from iter loop and does not require copying data.
  2. CircularIndex just wrap around to fetch more data if needed. We can therefore for use a uniform distribution of the sequence length in the batch. This is simpler to describe in papers and to experiment with than the current asymmetrical distribution. All data in the source data re used because the number of batched are math.ceil’ed

The proposed version does not pass an extra long sequence in the first batch is this important ?

The function “def usedGB_RAM” i created using psutil has been usefull on linux but not on windows. I think i would be better to use the utils that @stas reated

Concerning accuracy i made a run with nTrainToks, nValidToks = int(5e5),int(1e5) with the current vs the proposed version. bptt and the randomization range p_bptt has not been optimised for the proposed LanguageModelLoader. I am doing it as i write and can see that it has a big influence on the first 2 epoch

Current fastai version takes Total time: 6:33:34 with the following convergence:

epoch train_loss valid_loss accuracy
1 4.770611 4.915845 0.236449
2 4.228181 4.462797 0.265781
3 4.023810 4.332295 0.275488
4 3.928257 4.261923 0.281257
5 3.844547 4.202448 0.287016
6 3.752758 4.150086 0.292125
7 3.671401 4.109411 0.297047
8 3.572032 4.090855 0.300316
9 3.487761 4.091266 0.301088
10 3.408029 4.103933 0.300383

The new version takes Total time: 5:24:28 with the following convergence

epoch train_loss valid_loss accuracy
1 4.922954 5.067150 0.222395
2 4.436327 4.680810 0.245361
3 4.267771 4.560742 0.255486
4 4.165852 4.486516 0.261657
5 4.074946 4.428182 0.266517
6 3.983813 4.383028 0.272105
7 3.875283 4.337162 0.276499
8 3.794235 4.310738 0.279891
9 3.725225 4.300746 0.281603
10 3.704621 4.301591 0.281501

GPU Optimizations Central
GPU Optimizations Central
GPU Optimizations Central
(Stas Bekman) #2

@Kaspar, FYI, I moved your thread into a category that’s visible to all users, so that it will get more exposure and hopefully lead to a better outcome.


I may have misunderstood something in your code, but I get the impression that you’re taking a chunk of texts of size bs * seq_len then spitting this resized as bs x seq_len to make a batch, but this is very different from the LanguageModelLoader: on your next batch, the text of the first row won’t be what’s after the text in the first row of the first batch, so we’re not taking advantage of the hidden state.

The approach in the current LanguageModelLoader is to have all the (maybe shuffled) texts concatenated then resizing to bs x n and then reading those in order (the only difference with my new version is that all of this is lazily computed to avoid huge memory use).

That’s why there is this loss of convergence I think.

(Kaspar Lund) #4

Yes the way we run though the rags is different and one of the reason i did not proceed to a PR

In the following i will call your method and my method, because i do not have a name for them. a bit unfortunate because we both share the same goal

i believe that you run though the rags like in the following way where:

  • r1-18 are shuffled according to the argument i LanguageModelLoader.init.
  • “|” marks the borders of a batch
  • the sequencelength of each batch can change from batch to batch according to p_btt

batch 1 ------------------batch2----------- batch3-----------batch4--------------batch 5:

I run through the rags like in the following way where:

  • “|batch1-n” marks the beginning of each batch
  • batch5 may wrap around to the head

batch 1:

The two methods chop up the rags in different ways. Your method may be randomizing the text more than i do but i am not sure this is essentiel because randomization is aleady handled by the shuffle.

An alternative way a think about it is: If shuffle = False and the rags represented a document with chapters each of 5*batchsize then the difference would be:

  • your method would iterate though all chapters simultaneously in 5 batches
  • my method would iterate though the chapters consecutively in 5 batches

Difference in convergence:

  • That is a problem and i still do not understand why the above differences could be the cause. Am i missing something in the way the batch is processed ?
  • I am using sentencepiece’ unigram-model and i am investigating whether the difference in convergence is due to the circumstance that there are more unigram-pieces in a sentence than wordpieces. So far i can see that bptt needs to be higher when using sentencepiece. I am also looking at the influence of the width of the uniform randomization of sequencelengths “p_bptt” :

The following i for:

  • nTrainToks, nValidToks = int(5e5),int(1e5)
  • learn.fit_one_cycle(2, 2e-3, moms=(0.8,0.7))

I guess a bptt=130 would give better results for both iteration methods when using sentencepiece. p_bptt 0%, 5%, 15% and 15% gives the same results

(Kaspar Lund) #5

I went for walk to think about some observations that have been nagging me:

  • As i understand it the current fastai chop through the rag in a more discontinuous way than the LanguageModelLoader i propose - with better accuracy as a result
  • I made an experiment where each batch started at the beginning of a rag hoping it would improve accuracy - it didn’t. Guess that the better alignment during training reduces the models generalisation!

This makes me wonder whether the current fastai implicitly makes a sort of inter batch cut-out of tokens thus making the RNN more robust to alternative variations of a sentence.

I could implement inter- and intra- batch cut-out relatively easily to see if that can explain the difference i accuracy ?

Right now i am making 10 epoch with the current fastai and my LanguageModelLoader with bptt=130 and p_btt=0 to get a new baseline. That will take about 11 hours. Depending on the results i could run a cutout experiment after that


On the batch dimension, yes, but on the sequence dimension it’s continuous. In your implementation the batches are a continuous chunk of text but we don’t care about that in an RNN, what is important to take advantage of the hidden state is to have the tokens come continuously across batches.

(Kaspar Lund) #7

that could be done:
-It would require 2*bs ints to hold the offsets into the rags instead of 2 ints
-probably be a bit slower because its is more expensive to index an array than addressing a local int

(Kaspar Lund) #8

I completely missed that pytorch/fastai maintains the internal stat of the rnn across batches pr row - soo embarrassing. i’ll adjust my approach.

(Piotr Czapla) #9

@Kaspar great writing, I see you are putting quite some effort in optimising the memory consumption and speed I love this kind of fiddling myself, and It makes a lot of sense for things that you use in production. When you train readability beats performance. I don’t mind an additional hour of training if can avoid running the training (20h) again because of a bugs in a way I’ve trained LM.

What I find interesting in your idea is the ability to train on the whole wikipedia. I know that Jeremy reported that this does not improve on the down stream classification tasks. But given the reports from Bert (they attribute their performance to the large dataset they have used).
It may be helpful to exmperiment with it a bit more on different tasks NER, NLI and network types QRNNs,Transformers. So being able to train the model on arbitrary large amount of data is quite interesting.

Have you considered using memory mapped numpy arrays to keep the readability of the code and to be able to handle larger data sets? here is pytorch thread about it This won’t give you the performance improvements but it will let us train on the whole wikipedia. And if you think about it a bit longer you might find a way to get the performance improvement as well.

Regarding your question why it is important to load a larger batch first. It is because memory GPU management of pytorch, it reduces the memory fragmentation and lets you avoid OOM exceptions.

I hope this helps, and keep up good work and effort to make fast.ai better!

(Abu Fadl) #10

I can also confirm that for classification tasks, a small portion of wikipedia was good enough (22m, sampled from full). I am looking forward to see the great work of Kasper (and your lm) integrated into fastai. Still getting OOM for larger wiki texts (GPU Optimizations Central).
One quick question @piotr.czapla : Am I correct that current fastai (v 1.0.39) does not read bilstm models?

(Piotr Czapla) #11

It is hard to figure out how to train multi-layerd BiLSTM on language modeling in an efficient so I dont think it will be supported anytime soon. in ulmfit-multilingual I’ve implemented training backward and forward LM at once, following ELMO paper, but it uses huge amount of gpu memory, but it should work with the current fastai version.

(Abu Fadl) #12

You mean by setting bidir = True and testing classifier via ulmfit.train_clas (https://github.com/n-waves/ulmfit-multilingual/blob/master/tests/test_end_to_end.py#L117)?

(Piotr Czapla) #13

Yep, it was working fine when i was testing on GPU with 32GB of ram now when I’m down to 12GB so I’ve postponed the work on bidir until i get the XNLI baseline done.

(Kaspar Lund) #14

Thx that was a tricky one.
I have adjusted the loading of batches in order ensure continuity pr row in the batch across batches + simplifying the code. I am fiddling to achieve the same accuracy as the current version of fastai.

(Kaspar Lund) #15

I have previously used memory mapped files but i do not believe it is required here, because we just run through the data one small batch at a time.

(Piotr Czapla) #16

Kasper, I meant to keep the existing code of data loader with minimal modifications to use memory mapped numpy. Your code is super smart but hard to read compared to Sylvains.

If you use mmaped numpy you may get the same objective and keep the readability unchanged.

(Kaspar Lund) #17

i agree that indexing through the ragged arrays still looks complicated and i am not sure whether the code or part of it will be submitted as a PR. That will be up to you, sylvain and the rest of the community.

However, i have really learned a lot about python, pytorch and rnn going to this level of detail.

(Sudarshan) #18

I am posting from my previous post from here.

Fine-tuning the LM takes an extraordinarily long time with a decreased efficiency. Please checkout this notebook. I think this was run nearly a 20 days ago. If you scroll down to the first epoch (after lr_find), you can see it took a little over 2 hours. And after running for 11 epochs which took about 28 hours, the accuracy is aboud 0.583.

However, during my latest run (about 10 days ago), a single epoch took over 12 hours and the full 11 epochs took over 5 days! And the accuracy was nearly 13 points lower (0.449). I used the same dataset with the same exact code. Everything was run in one session (ie, no closing the notebook and loading in the data again).

Everything was run on a system equipped with a V100 (16G of video RAM) and 376G of memory with the latest dev version of FastAI (I always do git pull; pip install -e.[dev] before I start my work). Any idea whats going on?

(Kaspar Lund) #19

i have no idea . I do know that sgugger changed the tensor layout 1-2 weeks ago so that batch column is now the first. but that should not change the accuracy and i cannot imaging that it will change performance that much.

I am close to having a new version of MyLanguageModel so when i am finished we could test it out to see if that makes a difference, I would not expect so unless memory fragmentation is the root cause of what you see concerning performance

(Sudarshan) #20

How is your language model different than the one provided by FastAI?

Could you elaborate more on that?

Perhaps @sgugger can shed some light on this. I could use some pointers on where/how to debug this problem.