SaveModelCallback Not Working for Collaborative Filtering

I haven’t tested it with anything else, but this callback isn’t working when trying to fit a collab learner. The callback is initialized on training start, but none of the methods are called and the model isn’t saved nor the best loaded at training end. Any ideas on why?

How and when are you setting up the callback? On Learner or on fit

On fit:

def collab_train_eval(dls, use_nn, y_range, layers, epochs, max_lr, wd):
    learn = collab_learner(dls, use_nn = use_nn, y_range = y_range, layers = layers, metrics = [rmse, mae])
    learn.fit_one_cycle(epochs, max_lr, wd = wd, callbacks = [
       SaveModelCallback(monitor = 'valid_loss', 
                         fname = 'optim/temp_best_collab')
    ])
    learn.fit_one_cycle(epochs, max_lr, wd)
    return learn.validate()

Also, slightly unrelated but I know you’ve been working a lot with tabular data – dropout doesn’t work on the tabular_learner in fastai2. I tried submitting a PR but I was being blocked and I haven’t gotten around to going through the process of doing it, but figured I’d let you know in case you’ve seen any weird behaviour :slight_smile:

1 Like

Hmm… odd. I’ll take a look at both this weekend :slight_smile:

1 Like

Hi everyone hope all is well and your having a jolly weekend!

I am doing lesson 8 on google colab and the notebook is failing at this point, when finetuing the model.
Its not completing the 10 epochs.

   learn.unfreeze()
   learn.fit_one_cycle(10, 2e-3)

                                                                   80.00% [8/10 8:02:46<2:00:41]
epoch	train_loss	valid_loss	accuracy	perplexity	time
0	3.893891	3.804772	0.313039	44.915024	1:00:20
1	3.864590	3.758198	0.318095	42.871124	1:00:13
.......
7	3.567819	3.620880	0.334902	37.370434	1:00:20
 37.50% [1972/5258 21:36<36:00 3.5383]

This has failed about six times now!!

Things tried so far

  1. reduced the batch size to 64 instead of failing after 2-3 epochs, it’s now failing after 6 - epochs but each epoch takes longer.
  2. created a keep alive as it fails if my screen saver comes on for longer than 15-30m?
  3. I will try reducing the batch size to 32.
  4. I have tried creating a simple callback with with the following.
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3, callbacks=[SaveModelCallback(learn, every='epoch')])

This gives the following error:

TypeError: argument of type 'LMLearner' is not iterable```

learn.unfreeze()
learn.fit_one_cycle(10, 2e-3, callbacks=[SaveModelCallback(learn, every='epoch', monitor = 'accuracy')])

This gives an error also.

Error __init__() got multiple values for argument 'monitor'

Q. what is correct way to configure the simplest SaveModelCallback for this NLP model?

Any information greatly appreciated.
Cheers mrfabulous1 :grinning: :grinning:

Did you ever a chance to look into this?

1 Like

I’m afraid I did not, I’d open an issue with a reproducer on the github :slight_smile:

Okay no worries – will do :slight_smile: