How to fit ULMFiT in GPU Memory

Hi all,
I read the recent paper on ULMFiT and before that .
Amazing amazing contribution and work!

I’ve been trying to recreate the results by using a custom dataset, however unless I reduce the batch size to 1 and reduce network size, I am not able to fit model in GPU memory. Here are some tech details:

  • Size: 110M tokens
  • Vocabulary: 2M unique tokens
  • Max words: Tried both most frequent 30k and 60k
  • I have written generator to load data on demand and am doing all data transformations in parallel on CPU


  • CPU: Xeon 12 cores
  • Memory: 64GB
  • GPU: Nvidia 1080Ti / 11172MiB


  • Embeddings: 400
  • 3 Bidirectional LSTM layers (merge mode = concatenation )
  • Last layer is a Dense layer with softmax activation with sparse categorical entropy as loss

What I’ve tried:

  • Reducing Batch Size, Max words, RNN and Embedding size
    The working network I’m left with has 10k top words, 50 embedding size, 128 RNN of 2 layers and a batch size of 32 which take forever(1 day and still going, not even 1 epoch completed).

The primary bottle is last layer where the number of classes are so huge(30k, 60k). This makes me really curious how @jeremy was able to run everything in one go.

Once again, thank you taking the time to read!!

1 Like

I think there is something else going wrong here, my ULMFiT use and fitting was done with 32MB of RAM and a 1080Ti. While I had to play with the batch size for the 1080, I’ve not had to reduce size that much.
Maybe something is up with not properly detaching old calculations or so?

Best regards


1 Like

Thank you Thomas for the reply and link!