The purpose of this topic is to optimise the memory footprint of the class LanguageModelLoader while maintaining or improving the accuracy of ULMFIT.
I ran into the memory issues in LanguageModelLoader because it went on to create a notebook to process an entire wikipedia for any language using sentencepiece+fastai. There has been so many obstacles in making this work for the entire english wikipedia that I still havn’t used it to prediction on imdb. However the memory footprint has now been reduce to 1% of the fastai version i started with early december. The entire english wikipedia can now be processed give enough GPU days/weeks.
i am very nervouse about making changes that could reduce the excellent results of ULMFIT and hope that the community can help ensure/verify that this contribution preserves or improves the accurary/convergence. I do believe that there is a good chance that accuracy or convergence could be improved, because it is now 1) so much easier to randomize the sequence length of the batches and 2) what token a batch/sequence should begin - the latter will require a little fine tuning.
Lets get to it:
@sgugger already optimised it considerably by generating a batches on the fly instead of batchifying the entire ragged array of ids. This proposal https://github.com/kasparlund/nlp/blob/master/languagemodelloader.py goes a step further and simplifies the “def iter”.
@piotr.czapla has also made optimisations using week reference i the calling training loop i think
The memory footprint has been minimised by avoiding to copy source data except when a batch has to be filled with data and by mimimising allocation in general:
- by using a index array with inplace shuffle and direction of the iterations (forwards vs backwards)
- by preallocating storage for the batch an reusing this storage in every iteration. The cost is a loop that moves through the ragged array in “def fill_buffer” batch by batch. The loop has been optimised in order to preserve speed by reducing the number of book-keeping parameters and using local variables where it payed of (a local “i” is about twice as fast a “self.i”)
- the batch storage is a np.long array for now, because pytorch cannot (at this momemt) use slices in assignment. It can however used a numpy array as storage and build views of torch slices in order to deliver the x and y to the calling training loop-without copying the storage. I hope to switch to using af LongTensor when pytorch implements slicing in assigmment like self.buffer[ibuf:ibuf+rl] = rag[r0:r1] with self.buffer as a Longtensor
- The source data can use the smallest np.dtype for the size of the vocab: int16 for a 32000 vocab and possibly uint16 for a 64000 vocab-i have yet to test the latter. This saves a lot of memory and is now possible because @sgugger has removed the Intprocessor TextDataBunch.from_ids constructor.
The iter loop has been simplified by using a Circular indexer: “When the index exceeds the length of self.idx then it is wrap to start at the head or the end if the indexing i set to move backwards”. This results in the following advantages.
- shuffle and index direction is hidden from iter loop and does not require copying data.
- CircularIndex just wrap around to fetch more data if needed. We can therefore for use a uniform distribution of the sequence length in the batch. This is simpler to describe in papers and to experiment with than the current asymmetrical distribution. All data in the source data re used because the number of batched are math.ceil’ed
The proposed version does not pass an extra long sequence in the first batch is this important ?
The function “def usedGB_RAM” i created using psutil has been usefull on linux but not on windows. I think i would be better to use the utils that @stas reated
Concerning accuracy i made a run with nTrainToks, nValidToks = int(5e5),int(1e5) with the current vs the proposed version. bptt and the randomization range p_bptt has not been optimised for the proposed LanguageModelLoader. I am doing it as i write and can see that it has a big influence on the first 2 epoch
Current fastai version takes Total time: 6:33:34 with the following convergence:
The new version takes Total time: 5:24:28 with the following convergence