SaveModelCallback() random fail: No such file or directory (but usually works fine)

Unet / ResNet doing image segmentation. Using Optuna for hyperparameter tuning.

code_path = os.getcwd()
model_dir = "segmentation_model"

src_learner = unet_learner(
    src_dataloader,
    resnet34,
    n_out=256,
    path=code_path,
    model_dir=model_dir,
    loss_func=FocalLossFlat(axis=1, gamma=focal_gamma),
)

with src_learner.no_bar():
    src_learner.fine_tune(
        epochs=20,
        base_lr=src_lr,
        freeze_epochs=freeze_epochs,
        cbs=SaveModelCallback(
            fname="model_" + str(trial_number),
            with_opt=True,
        ),
    )

trial_number is an Optuna trial number and is an integer, incrementing by 1 each time.

Here’s the problem: this usually works just fine, it trains the model, it saves the best model, then goes to the next trial, adjusts parameters and does it all again - but occasionally it goes on the fritz and starts throwing errors such as this:

[Errno 2] No such file or directory: '/home/florin/code/src/segmentation_model/model_99.pth'

That folder exists, and already contains all the models from the other, successful trials. The permissions are right, I run everything as that user account - and the other trials would not succeed otherwise.

When the error occurs, there is actually no such file in the models folder. It looks like SaveModelCallback() tries to open a new file to save a new model, and for whatever reason it fails.

The error, AFAICT, is random. My error handling catches it and returns a NaN to Optuna so the trial is skipped. Usually, after many failed trials it somehow recovers and starts cranking out models again.

There’s no discernible pattern to the trial numbers that may cause this.

The code is literally the same - it’s just Optuna training model after model in a loop, slightly changing the parameters each time.

The disks are not full at all - I have many hundreds of GB available. I’ve tried with a magnetic drive and an SSD drive, the error is the same.

What could possibly cause this?

I will try to investigate basic things, such as the open files limit, etc. The dataloaders do open a whole lot of image files, maybe it’s related.

florin@media:~$ ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 127808
max locked memory           (kbytes, -l) 4098720
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 127808
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

I’ve found a pattern. All failed trials have freeze_epochs = 0. All non-failed trials have freeze_epochs >= 1.

If I remove the callback, then fine_tune(freeze_epochs=0) works fine.

This looks like a bug to me. The default is unet_learner.fine_tune(freeze_epochs=1) so I guess most users will use that, or try larger values. There might be an assumption in the fast.ai code (maybe the callback function) due to the widely used default value for freeze_epochs.

I added this to my code after I instantiate the learner but before I run fine_tune():

if freeze_epochs == 0:
    src_learner.save("model_" + str(trial_number), with_opt=True)

With these lines in place, the crash does not occur anymore, because the file is there before training begins. But is it a good idea? Does that have any side-effects that I may not be thinking of (not being super-familiar with fast.ai)?

1 Like