I’m training a neural machine translation model where I have tens of millions of lines of data – it is finetuning on top of a pretrained model, and would take maybe 35 hours on 8 voltas to run through one epoch, is it still a good idea to let the model run for at least one epoch?
Of course this depends on the train / valid loss, but the improvement in valid loss also depends on the learning rate, which is declining for the rest of the epoch after warmup, so the improvements to valid loss would flatten out just by virtue of that, and I expect that the gain I get would not be too great. But I also think that by the time it cycles to the second epoch, there would be some improvement because learning rate warms up again, but it is very costly to wait for the end of the epoch just for the learning rate to go back up again. There are two questions in my mind that I’m not sure about:
1 - is it generally good practice to have the model see all the data once, i.e. at least one epoch?
2 - is it possibly ok to do a learning rate schedule for sub-epochs, like maybe one-tench of an epoch, when the epoch is very long and the data is similar.
While I’m asking for an NLP model, I think this question can apply for any kind of deep learning model where the dataset is huge.
Any thoughts appreciated. Thank you!