Unet / ResNet doing image segmentation. Using Optuna for hyperparameter tuning.
code_path = os.getcwd()
model_dir = "segmentation_model"
src_learner = unet_learner(
src_dataloader,
resnet34,
n_out=256,
path=code_path,
model_dir=model_dir,
loss_func=FocalLossFlat(axis=1, gamma=focal_gamma),
)
with src_learner.no_bar():
src_learner.fine_tune(
epochs=20,
base_lr=src_lr,
freeze_epochs=freeze_epochs,
cbs=SaveModelCallback(
fname="model_" + str(trial_number),
with_opt=True,
),
)
trial_number
is an Optuna trial number and is an integer, incrementing by 1 each time.
Here’s the problem: this usually works just fine, it trains the model, it saves the best model, then goes to the next trial, adjusts parameters and does it all again - but occasionally it goes on the fritz and starts throwing errors such as this:
[Errno 2] No such file or directory: '/home/florin/code/src/segmentation_model/model_99.pth'
That folder exists, and already contains all the models from the other, successful trials. The permissions are right, I run everything as that user account - and the other trials would not succeed otherwise.
When the error occurs, there is actually no such file in the models folder. It looks like SaveModelCallback()
tries to open a new file to save a new model, and for whatever reason it fails.
The error, AFAICT, is random. My error handling catches it and returns a NaN to Optuna so the trial is skipped. Usually, after many failed trials it somehow recovers and starts cranking out models again.
There’s no discernible pattern to the trial numbers that may cause this.
The code is literally the same - it’s just Optuna training model after model in a loop, slightly changing the parameters each time.
The disks are not full at all - I have many hundreds of GB available. I’ve tried with a magnetic drive and an SSD drive, the error is the same.
What could possibly cause this?
I will try to investigate basic things, such as the open files limit, etc. The dataloaders do open a whole lot of image files, maybe it’s related.
florin@media:~$ ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127808
max locked memory (kbytes, -l) 4098720
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 127808
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited