I’m trying to train an NLP model - I have a training set of around 20,000 documents, an average of 45 words per document and around 5000 documents reserved for validation. I actually have around 5X this number, but I wanted to start smaller for speed.
The time for each epoch to run seems to be random, it will often stay fixed at around 20 minutes, for perhaps 5/6 epochs, then randomly start increasing. I’m currently seeing 1hr30min training time per epoch. Nothing has changed, its the exact same run (e.g., I’m using
learn_lm.fit_one_cycle(20, 1e-2, moms=(0.8,0.7)).
I’m working on an AWS EC2 P2.xlarge instance, the CPU stays at 50% the entire time (regardless of epoch training time) and both the network and disk usage are low.
What causes the per-epoch training time to vary so drastically, and is there anything I can do to speed up the process? Unfortunately I cannot use pre-trained models for this, so I have to start from scratch.