Training LM on 3GB data, using 3GB graphics card


TL;DR: Can I get some tips on training a language model at least 1GB data in the form of wikitext-103 on a graphics card with 3GB memory?

As a part of my master thesis I’m making a language model in Swedish which I’ll later on implement my primary architecture on (I’m building a summarizer using ULMFiT backbone), but I’ve run into some issues.

The main problem is that the graphics card which I can use is a 780 Ti, which has 3GB graphical memory. I datamined the Swedish wikipedia to be in the same format as wikitext-103 which lands around 600k articles, or 3GB. Now I’m somewhat happy with the data mining, and I’d like to use at least 1GB of sv-wiki in order to create a good enough model, but this proved hard to achieve.

Through extensive trial and error I found I can’t train on more than about 5k articles at once (20MB) using:
data = (TextList.from_folder(path_to_one_batch,vocab=vocab).split_from_folder() .split_for_lm().databunch(bs=16))
learner= language_model_learner(data, LSTM_AWD, ... ),
since that seems to allocate too much memory (even using batch size lower than 8). So to solve it I split up the articles in 120 prebatches, one batch per folder 0-119, 5000 articles per batch in their own text file (article0.txt, article1.txt, …), in a train, test, validate folder structure in order to be loaded by the fastai library seamlessly, then i iterated through each and created a db which i saved to pkl for easy loading when training. My core idea here is that the order “shouldn’t” really matter (worst case it’ll become soft reboot discriminate learning, or something like that) as long as I train on enough tokens and train the same learner.

Now I couldn’t simply replace the data from the learner since this caused some sort of CUDA error, so as a workaround i save the encoder of the learner, create a new learner and then load the encoder of the previously trained learner. In theory it should keep training on the same learner iteration to iteration. Said and done, it executes anyway!

So I just came back after about 17 hours of training, and something is obviously amiss, see picture below. I notice that I achieve a perplexity of about exp(~2.5) ~= 13 after only one epoch, “world record”, yes! …sort of, something is obviously not working, and it’s time to get some help.

Long thread, I’m sorry, thank you for given time.


EDIT: Typos

1 Like

You could upload the files to google drive and run training on google colab. I did on the Norwegian Wikipedia and was able to do 128 in batch size with tesla T4. Each epoch took 30 minutes with fp16 and vocab size as a multiple of 8. Make sure you save the model before the kernel restarts.

To get files from drive in colab

from google.colab import drive

I increased the batch size to 12 instead of 120 (which juuuust fit the graphics card, minmaxed), then I applied a very low learning rate and let it train for a very long time. This seemed to work fairly well. I’ve done a 20h test which sadly overfitted quite badly, but doing some test predictions proved that grammatical aspects seemed correct, so it’s working! Doing another new training session now which will last over the weekend, ~60 hours.

@gustav I don’t want to do such large scale training on google colab since I feel it might get canceled by a lot of different aspects. But I checked it out and found some other great experimental uses for it, so thank you for the tip!

The GPU memory limits the size of the batch or the model you can use for LM training because that goes on the GPU. If you build the DataBunch as usual for your data set, do you get some errors related to GPU or GPU OOM errors? That would be surprising to me since that code is not using the GPU till you start to train. You might get errors with RAM on the machine if that is limited and you are using SpaCy (default tokenizer in fastai.)

So, first question, can you build the data in a single databunch? If so, then you can always scale down to a batch size of just 1 and see if that fits on the GPU (and accumulate gradients across many 1-size batches to take your gradient step.) If it does not, you need a smaller model.

FWIW, I can load up the standard AWD_LSTM with 3 GB of GPU memory using the IMDB sample data and unfreeze and train. Peak memory usage is low. But, with NLP, ymmv so getting a handle on the exact nature of your problem will help a lot to figure a strategy to get around it.

Can you share your data or code pipeline so I can have a look and/or try to replicate your problem with a public dataset to make collaboration easier?

Well, both. First i got some ram errors due to the dataset being 3 GB large. I even tried it on a 16 GB ram system and it still didn’t fit, which sort of makes sense considering the amount of data. Wikitext is about 600 MB, so the dataset is unrealistically large.

When i split the data up in 1GB batches I could fit it into a DataBunch (I used this to get a representative vocab for the whole dataset), but it was far too large for the GPU, it crashes when trying to fit/do lr_find.

Much appreciated! But I managed to crawl over this bump in the road on my own anyway, the thread is now somewhat irrelevant. I plan on sharing my Swedish wikipedia dataset + my approach to training on it, but not this version since it’s rather crude (yet working), next version maybe!