So I am training a model with one cycle for 1 epoch for a Kaggle competition (google doodle). My dataset consist of 70K * 340 (NUM CLASS) many samples. I am using batch size of 800 (as much as the GPU memory allows me). The code is a modified version of @radek 's Fast.ai starter pack.
In my first try I set dataloader’s
num_workers=8 to utilize the multiprocessing, but had SIGKILLs. There are many related issues in pytorch forums: https://discuss.pytorch.org/t/runtimeerror-dataloader-worker-pid-26317-is-killed-by-signal-aborted/16879 and probably here as well.
I’ve tried changing
create_func and my dataloading process but still couldn’t get over this problem. At the moment I’ve reduced the
num_workers=4 and using a smaller batchsize=200. ETA of training is 10 h and it’s still going on smoothly.
My real question is this:
I am seeing cpu memory usage doesn’t fluctuate around a mean value but rather it linearly increase. Shouldn’t CPU memory usage be theoretically at most
max(size(item))*(batch size). How can CPU memory usage linearly goes up as training continues?
Here is my script if it helps to understand what i am running (48 lines):