Epoch training time

#1

I’m trying to train an NLP model - I have a training set of around 20,000 documents, an average of 45 words per document and around 5000 documents reserved for validation. I actually have around 5X this number, but I wanted to start smaller for speed.

The time for each epoch to run seems to be random, it will often stay fixed at around 20 minutes, for perhaps 5/6 epochs, then randomly start increasing. I’m currently seeing 1hr30min training time per epoch. Nothing has changed, its the exact same run (e.g., I’m using learn_lm.fit_one_cycle(20, 1e-2, moms=(0.8,0.7)).

I’m working on an AWS EC2 P2.xlarge instance, the CPU stays at 50% the entire time (regardless of epoch training time) and both the network and disk usage are low.

What causes the per-epoch training time to vary so drastically, and is there anything I can do to speed up the process? Unfortunately I cannot use pre-trained models for this, so I have to start from scratch.

0 Likes

#2

Never mind. Turns out AWS gave me a bum instance with no GPU :expressionless: setup a new instance, and it trains monumentally faster :expressionless: 3 days wasted!

0 Likes