Developer chat

Thanks for implementing this. I am trying to use the get_transformer_lm and build a Wiki103 model from scratch. It does not appear to be training well with fit_one_cycle which is kind of surprising to me. I am wondering if there is something about the self-attention that might not be right or if the training params are critical and one_cycle won’t work well. Any ideas/insight? I can move on to trying TransformerXL, but I figured I would get some kind of reasonable perplexity and accuracy with the self-attention model right out of the original paper

Hi, Stas –

I think this is great, but I’m wondering if only checking actual available RAM is the best approach. I say this because, as I’ve been going through the lessons, I found that sometimes my GPU RAM got tied up even though nothing important was happening in the notebook – somehow the Python process got stuck (and not necessarily because CUDA OOM exceptions occurred beforehand).

I found that I could use nvidia-smi, find the process ID that was taking up RAM, and then kill {process_id} at the terminal to free up the resources. Of course, this would reset my kernel, but I was trying to do that anyway in the notebook and it wasn’t working.

Just a thought. What do you think?

EDIT: I should say that as far as the application you’re describing, I think your solution is necessary for those who can’t even run cells that require more GPU RAM than they have. I’m more speaking to the circumstance where one might have the GPU RAM available if they killed processes that weren’t actually doing anything useful.

You can definitely suggest a PR to add smart _default_meta

  1. pytorch caches RAM so nvidia-smi isn’t showing you the real picture. My code empties the cache so you get the actual reflection of the real available memory (to a degree due to fragmentation, so realistically it’s perhaps 80-90% of what you see). Note, the “no_cache” in the function name:
  1. if your memory is really tied up and can’t be flushed, then you won’t be able to run the cell that requires X-amount of free memory anyway.

The only caveat there is potentially an unreleased learner, that requires gc.collect() as well due to circular references. Probably should add it to the function as well.

Ideally, you should be able to free the memory and continue with your code w/o needing to restart. We are not 100% there yet, but are moving in that direction.

If you notice any situations where memory gets leaked and can’t be reclaimed please report those as it’ll help this effort.

You will need to run:

import gc, torch

to be able to see the true representation via nvidia-smi, since it is not aware of pytorch cached memory. But the easiest way is to use: which will automatically profile the memory usage for you and that way you can quickly see where the leaks are if any. You won’t need to watch nvidia-smi any longer.


I see, I see.

Oh, interesting. Sounds like I should be playing with gc.collect() to see if that clears up my memory.

I think that’s great. I was also experimenting with batch sizes to take full advantage of available GPU RAM, but ideally, we would have the Learner or a peripheral do that dynamically, which I think you may already be working on in ipyexperiments. :+1:

Will do!

6 posts were merged into an existing topic: Machine Learning to Automate Learning


language_model_learner and text_classifier_learner have both been changed to look like create_cnn. Why? There are now more language model architectures in fastai, and hopefully soon, more pretrained models. So you should now use:

learn = language_model_learner(data_lm, AWD_LSTM)

for the old behavior of language_model_learner and

learn = text_classifier_learner(data_clas, AWD_LSTM)

for the old behavior of text_classifier_learner (see the test example).

The good thing is you can type Transformer or TransformerXL instead of AWD_LSTM and it just works :slight_smile: Well almost, you have too add pretrained=False because there are no pretrained models for those yet. You can still pass drop_mult in both cases, and if you want to modify the defaults, you can pass a config dictionary with all the values of the kwargs (the default configs are in the dictionaries awd_lstm_lm_config, awd_lstm_clas_config, transformer_lm_config,transformer_clas_config, transformer_XL_config, transformer_XL_config). There is an example of how to change their values in the tests.

The bad thing is that you won’t be able to load directly your old models for classification (language models are fine). I had to add another layer to make it work across architectures. Normally, this function should allow you to load an old model in a new learner:

def load_old_to_new(learn:Learner, path_to_model:PathOrStr):
    wgts = torch.load(path_to_model)
    if 'model' in wgts: wgts = wgts['model']
    wgts0 = OrderedDict({k[2:]:v for k,v in wgts['model'].items() if k.startswith('0.')})
    wgts1 = OrderedDict({k[2:]:v for k,v in wgts['model'].items() if k.startswith('1.')})

What is the layer you had to add?

I wonder how breaking would be to take the flatten/dropout layers out of the classifier heads and into their own layer. This would make it easy to turn text models into siamese networks and vice versa…

running two rnn’s on the same gpu now :slight_smile:
certain not possibel before your improvements on usage of gpu-memory - thx

1 Like

The added layer is a module that takes any type of encoder and feeds it a sentence. This was previously a module that subclassed the AWD_LSTM main module so I had to change that to make it support transformers. and models.tabular have both an export function, respectively data.export() to save the databunch and learn.export() to save the model. Great but the file name by default is the same: export.pkl

To avoid that one export() deletes the file of the other export(), we could rename the default export files, no?


Gotcha. Thanks

How come we reset the rnn-model on_epoch_being while training instead of restoring hidden states captured in epoch_end ?

Thanks for the example in tests so that we can run with our own params! I was worried that was lost till I re-read the post.

For Transformer and TransformerXL, after a lot of vetting with the paper author source code, I figured out that our RNNLearner has default alpha=2. and beta=1. defaults that don’t make sense in the Transformer context (they are regularization for the AWD-LSTM. So, when setting up to train from scratch I had to set alpha=0.,beta=0. to make any progress.

If it helps others. This is the code base I was comparing too (written by one of the authors)

Yes, especially the alpha. The beta can be put to 1, and I found it seems to help a little bit, but transformers don’t like Activations Regularization at all!

1 Like

Because we shuffle all the texts, so there is a discontinuity in texts at that point.

thats what i thought. I just made a rutine to ensure continuity across batches so i could make a PR ?.

After that we could have a look at saving state before running the validation dataset and restore state after

That was fast :smiley:

1 Like

Hi. In the lesson6-rossmann.ipynb, the learn.summary() does not display well in my jupyter notebook (see screenshot below). It looks like the new line character \n is not taken into account (I’m using Windows 10 but I do not think this is the problem).

That question has been asked already, you have to type print(learn.summary()) now. The function doesn’t print it for you, it returns the raw text now (for best practices we try to avoid functions that print things).

1 Like