I’m trying to train a text classifier on an 8GB Nvidia RXT and I can’t seem to unfreeze all the layers and retrain.
I’ve looked at the OOM debugger and tried using gpu_mem_restore_ctx
to clean up any potential memory leaks but I’m still getting errors.
For more clarify I am using a language model encoder that I trained with a bs=32 (trained fined without any errors). Using that encoder, when I set my classifier
to a bs=8
and bptt=20
I can unfreeze and train until about layer 3 (if I watch nvidia-smi every step I unfreeze at takes more and more memory) but after I unfreeze()
all layers I get and OOM error.
Are these parameters still too much for an 8GB card to handle? I have some pseudo code below to illustrate when the issue occurs.
learn.freeze_to(-1)
learn.fit_one_cylce(**kwargs) # Trains fine
learn.unfreeze()
learn.fit_one_cylce(**kwargs) # Out of Memory on first epoch
learn.freeze_to(-1)
learn.fit_one_cylce(1, **kwargs) # Trains fine
learn.freeze_to(-2)
learn.fit_one_cylce(1, **kwargs) # Trains with more CUDA memory usage
learn.freeze_to(-3)
learn.fit_one_cylce(1, **kwargs) # Trains with more and more CUDA memory usage
learn.unfreeze()
learn.fit_one_cylce(3, **kwargs) # Out of Memory on 1st epoch
I think I’m mostly confused as to why I can train my LM with a bs=32 using around 30K records but when trying to train this classifier with around 1500 records I can’t seem to get past unfreezing all layers.
Any thoughts would be appreciated.
As a side note, I have fastai installed in editable mode and have the latest changes pulled.
Thanks,
Andrew