Running out of memory on big image sets


I am running on AWS p3.2xlarge (64GB RAM). I have a training dataset of 65000 images which are all at least 600x600 so the dataset is > 25GB. I am resizing them with augmentation down to 320x320. What I notice is that when the augmentation is happening, the memory utilization keeps increasing until the process crashes. I would think this is because the memory used by the augmented images is not being freed. I have looked in the source code but can’t yet figure out how where to fix this.

Can anyone help?

In ImageClassifierData, set num_workers as 0. Its a problem with multi threading I guess

What’s your current batch size? Maybe try lowering it and see what happens?

Have you tried reducing your batch Size?
For initial cycles of training you can also reduce the image size expected by the model. This will crop the image, so lose some of the image, but it will train much faster and use less memory. As your model starts to overfit, then increase image size expected, but this is when you may find you need to reduce the batch size.

1 Like

I don’t think is batch size related, the augmentation is handled by a ThreadPoolExecutor with num_workers == n.

There is always the inital part of an epoch where the augmentation is done, then the GPU sits waiting for this to happen, so I increased the num_workers. I was using 30 (because there are 32 cores on that instance), the default is 8.
I would think the default of 8 was chosen for a reason because now mem utilization is staying constant at about 5%, at the expense of a slower epoch duration (all else being equal)

1 Like

if I remember this well, data augmentation is done on the cpu (most likely via the opencv library), so def. not batch size related.
The easiest way to ask for help in these cases might be to gist the stack trace and paste the link here. Else, we will just be guessing.

Could you share some of the relevant code please? would be more easier to understand what’s going on

@nextM what exactly is the error message? Is it something like OS error: out of memory?


This is the source code, the initial training of the FC layer before unfreezing

learn = ConvLearner.pretrained(f_model, data, ps=ps, xtra_fc=xtra_fc,
        precompute=False, metrics=metrics), 3, cycle_len=1)

Key to this is the data loader:

return ImageClassifierData.from_csv(path, f'{train_class}/train', labels_file_multi_dev, bs, tfms,
                                        suffix='', val_idxs=val_idxs,
                                        test_name=f'{train_class}/test', num_workers=30)

The initial augmentation starts, then data starts to be fed to the GPU. However, as there is more augmentation always occuring, the memory usage jumps up, around 500MB every few seconds

This time around, I can’t reproduce it. 2 differences:

  1. CPU utilisation never went above 800%, before it was at 2000%
  2. Memory usage maxed out at 50GB

It makes sense this process would use a lot of RAM, however I would think it could release some as these are random image being generated (unless they are being cached so they can be resent to the model?). So it would be great to try figure this out and get the RAM usage down (for example if you had 2 GPUs on your own box with 64GB RAM, this would prevent the 2nd GPU being used)

I think this could have been my actual issue. Looks like a dimension ordering bug in the latest version.

The RAM usage is high, but the stack trace was not because of a memory error. I don’t get this crash on the fastai version pulled on 12 March

@nextM did you resolve the issue?
did I understand correctly that with num_workers=8 there was low memory usage and with num_workers=30 memory exploded?
if that is the case, try find something in between. For my experiments I have modified source dataloader iterator, but I am not sure yet it applies here.