Due to the way the Language model input data is structured, with bs x rows (of the transposed table) = len(data), for any batch size, is adjusting the bptt value the only real way we have to reduce memory demand of the model at each batch, without going back to preprocessing the input text?
Could you please provide some exact numbers, for example “for 4 GB of GPU memory I use bs = 32 and bptt = 70”?
I ran into similar problems on laptop with 4GB GPU memory
I’m currently running an ancient greek language model that seems to be at the limit of my 6GB GPU. I should note that this is running on Windows so I’m guessing some values could be higher than that in Linux.
I’m currently running the default (400, 1150, 3) Language model with the following values:
bs = 24
bptt = 25
len(itos) = 120002 (had to increase this since Ancient Greek is quite a rich language. My 354 document corpus has about 155k unique tokens).