Model training in Databricks hangs

I’m trying to train a tabular learner in Databricks, and I am running into issues using the fit_one_cycle method for training.

If I specify only one training epoch, the training will usually complete. If I specify more than one epoch, training will hang with no error given. Sometimes the training will hang on the first epoch.

I’m running my code on a Standard NC12 worker with the 5.4 ML runtime.

Any help would be much appreciated!

It’s hard to help without seeing any code :wink:

@sgugger, @Andrew_Fowler
Same problem I have experienced on text classifier learner after unfreezing to some number as the below lines of code. This issue mostly arises when using OverSamplingCallback(learn).

learn.freeze_to(-2)
learn.fit_one_cycle(8, lr, callbacks=[SaveModelCallback(learn), OverSamplingCallback(learn),
ReduceLROnPlateauCallback(learn, factor=0.8)])

PS: Training has been done on v100.