I’m working on the Amazon dataset using my own computer with a Titan Xp. I have trained the network using the images. First with the size of 64, then I increased it to 128 and finally to 256. When training the data set with images with size 256, during the first epoch, Jupyter Notebook gives the next message:
The kernel appears to have died. It will restart automatically.
I also tried to check if the issue could be reproduced using the dogcats data set and that is the case (for this I had to copy the data set multiple times in the same folder). It occurred after running:
sz=224 model=resnet34 data=ImageClassifierData.from_paths(PATH,tfms=tfms_from_model(model, sz)) learn = ConvLearner.pretrained(model, data, precompute=True)
I found out that the cause of the freezing is that the CPU RAM is full and the swap is as well. The CPU memory is used during the batch training process. The memory use is (linearly) increasing. So it looks like something from all batches is kept in memory during training. And space used by former batches is not freed.
It also looks like if only during the first epoch the memory usage increases this much.
I checked if the GPU is used and that is also the case.
I found the issue several times less explicitly in other topics on the forum, but nowhere a solution was given. When googling, I could not find the same issue when using solely PyTorch. So I hope somebody can point out if I made a mistake or if this is an issue of the fastai library (or PyTorch library).