I read the recent paper on ULMFiT and before that https://www.youtube.com/watch?v=h5Tz7gZT9Fo&t=3209s .
Amazing amazing contribution and work!
I’ve been trying to recreate the results by using a custom dataset, however unless I reduce the batch size to 1 and reduce network size, I am not able to fit model in GPU memory. Here are some tech details:
- Size: 110M tokens
- Vocabulary: 2M unique tokens
- Max words: Tried both most frequent 30k and 60k
- I have written generator to load data on demand and am doing all data transformations in parallel on CPU
- CPU: Xeon 12 cores
- Memory: 64GB
- GPU: Nvidia 1080Ti / 11172MiB
- Embeddings: 400
- 3 Bidirectional LSTM layers (merge mode = concatenation )
- Last layer is a Dense layer with softmax activation with sparse categorical entropy as loss
What I’ve tried:
- Reducing Batch Size, Max words, RNN and Embedding size
The working network I’m left with has 10k top words, 50 embedding size, 128 RNN of 2 layers and a batch size of 32 which take forever(1 day and still going, not even 1 epoch completed).
The primary bottle is last layer where the number of classes are so huge(30k, 60k). This makes me really curious how @jeremy was able to run everything in one go.
Once again, thank you taking the time to read!!