Memory Error (Not cuda out of memory error)

I am using an Azure Data Science Virtual Machine which has a P100 GPU.

This is what I am doing:

  • Load Densenet201 with precompute=True
  • bs = 64 ( I tried 400 | 300 | 200 | 100), the GPU memory used is 4 Gigs of 16 Gigs Available
  • sz = 399

I am getting Memory Error, even though I am using just 30% of the total GPU memory available to me. Can anyone help me with this?

Stack Trace:
MemoryError Traceback (most recent call last)
in ()
----> 1 learn = ConvLearner.pretrained(arch, data, precompute=True, ps=0.5, xtra_fc=[10000, 10000])

~/fastai/courses/dl1/fastai/ in pretrained(cls, f, data, ps, xtra_fc, xtra_cut, precompute, **kwargs)
96 def pretrained(cls, f, data, ps=None, xtra_fc=None, xtra_cut=0, precompute=False, **kwargs):
97 models = ConvnetBuilder(f, data.c, data.is_multi, data.is_reg, ps=ps, xtra_fc=xtra_fc, xtra_cut=xtra_cut)
—> 98 return cls(data, models, precompute, **kwargs)
100 @property

~/fastai/courses/dl1/fastai/ in init(self, data, models, precompute, **kwargs)
89 elif self.metrics is None:
90 self.metrics = [accuracy_thresh(0.5)] if else [accuracy]
—> 91 if precompute: self.save_fc1()
92 self.freeze()
93 self.precompute = precompute

~/fastai/courses/dl1/fastai/ in save_fc1(self)
141 m=self.models.top_model
142 if len(self.activations[0])!=len(
–> 143 predict_to_bcolz(m,, act)
144 if len(self.activations[1])!=len(
145 predict_to_bcolz(m,, val_act)

~/fastai/courses/dl1/fastai/ in predict_to_bcolz(m, gen, arr, workers)
11 lock=threading.Lock()
12 m.eval()
—> 13 for x,*_ in tqdm(gen):
14 y = to_np(m(VV(x)).data)
15 with lock:

~/.conda/envs/fastai/lib/python3.6/site-packages/tqdm/ in iter(self)
953 “”", fp_write=getattr(self.fp, ‘write’, sys.stderr.write))
–> 955 for obj in iterable:
956 yield obj
957 # Update and possibly print the progressbar.

~/fastai/courses/dl1/fastai/ in next(self)
306 if self.i>=len(self.dl): raise StopIteration
307 self.i+=1
–> 308 return next(
310 @property

~/fastai/courses/dl1/fastai/ in iter(self)
74 with ThreadPoolExecutor(max_workers=self.num_workers) as e:
75 for batch in, iter(self.batch_sampler)):
—> 76 yield get_tensor(batch, self.pin_memory)

~/fastai/courses/dl1/fastai/ in get_tensor(batch, pin)
35 return {k: get_tensor(sample, pin) for k, sample in batch.items()}
36 elif isinstance(batch, collections.Sequence):
—> 37 return [get_tensor(sample, pin) for sample in batch]
38 raise TypeError("batch must contain numbers, dicts or lists; found {}"
39 .format(type(batch)))

~/fastai/courses/dl1/fastai/ in (.0)
35 return {k: get_tensor(sample, pin) for k, sample in batch.items()}
36 elif isinstance(batch, collections.Sequence):
—> 37 return [get_tensor(sample, pin) for sample in batch]
38 raise TypeError("batch must contain numbers, dicts or lists; found {}"
39 .format(type(batch)))

~/fastai/courses/dl1/fastai/ in get_tensor(batch, pin)
29 def get_tensor(batch, pin):
30 if isinstance(batch, (np.ndarray, np.generic)):
—> 31 batch = T(batch).contiguous()
32 return batch.pin_memory() if pin else batch
33 elif isinstance(batch, string_classes): return batch

~/fastai/courses/dl1/fastai/ in T(a)
11 if torch.is_tensor(a): res = a
12 else:
—> 13 a = np.array(np.ascontiguousarray(a))
14 if a.dtype in (np.int8, np.int16, np.int32, np.int64):
15 res = torch.LongTensor(a.astype(np.int64))


Thanks in advance!


Figured out the problem. RAM is getting used up completely.

yes! reduce batch size to 16 or so
with reduced learning rate

Thanks Divyansh. That seems not the problem.

The size of the dataset is around 200 GB, and somehow fastai loads the entire dataset into the memory, which is causing the Error.

Has anyone figured out how to fix this?

1 Like

fastai uses pytorch which actually uses generators to load the data. So I guess it never loads the complete dataset. However it creates on the go.

Even I thought the same, but the RAM is getting packed up after few iterations, not sure what causes that.

1 Like

The same thing also happen on Colab…

There is some kind of memory leak caused by using ThreadPoolExecutor in DataLoader (fastai/ A temporary fix is to disable multi-threaded execution altogether. For example the following diff will fix the issue but make loading data slower:

     def __iter__(self):
-        with ThreadPoolExecutor(max_workers=self.num_workers) as e:
-            for batch in, iter(self.batch_sampler)):
-                yield get_tensor(batch, self.pin_memory)
+        for batch in map(self.get_batch, iter(self.batch_sampler)):
+            yield get_tensor(batch, self.pin_memory)


In Python 3.6 is no more lazy. So in

 with ThreadPoolExecutor(max_workers=self.num_workers) as e:
     for batch in, iter(self.batch_sampler)):
         yield get_tensor(batch, self.pin_memory), iter(self.batch_sampler)) will eagerly apply self.get_batch() to every element in the iterable iter(self.batch_sampler), store the result internally in an iterable and then return a generator wrapped around the iterable. In our case it means all batches will be processed and kept in memory simultaneously, even though we only want one batch to be generated at a time.


Thanks for the workaround Farhan.

@jeremy is there a way to use multiple workers but still fix the memory leak issue?

I hacked around this problem by handling the batches a chunk at a time. Fixed for me - let me know if anyone sees any issues. I haven’t tested it carefully for edge cases (e.g. less rows that num_workers*10) so there may be odd bugs still…

1 Like

GPU on-board ram or regular standard motherboard RAM? I ask, because I am having serious issues with Cuda out of memory too, where it seems to eat all the available GPU ram - regardless of which gpu (tested on both 4GB and 8GB cards), as reported by nvidia-smi. If the issue is mobo RAM, then that is a MUCH, MUCH easier fix.

I had an issue with my CPU RAM, it was loading all the batches in the memory, Jeremy suggested a work around on that, you can check the above msgs for that.

For GPU, cuda out of memory error, you can try reducing the number of Images per batch, reduce it down to 4 or 8 may be and see if it works.

yeah, I tried 64, 32, 24, and then 16. I can’t understand why 16 would still fail - these images are smaller than 320 x 200, so that’s like less than a second of video of a DOS game - how could a modern GPU possibly have difficulties? The only think I can think of is that the model itself is that large, but even then, how could it be on the order of GB?

Whatever the cause, I guess a batch of even 16 non-HD resolution images is too much for a GPU with only 4Gb - at least on my setup. 6Gb should be minimum. Issue persists with 32Gb of RAM on 4gb card with usage hanging around 3gb before test phase. I think the weights and model are taking up the space - I’ve spent 2 days on it and i cant figure the cause out. luckily, a 1080 card didn’t have the issue when ram was boosted to 32gb.

Hi Jeremy - how did you chunk the batches? It’s worth me trying this before I try and mess with the

–Oh was it the bs parameter from “from_paths” ?

I reduced the batchsize to 16. It worked for me.

i have the same problem.
its work fine while i used small data, while tried using 70.000 records of wiki give me memory error
i was tried using bs=2 still not work.
i was tried to install cudnn 7.4 but still not work.
i was tried small chunksize but still not work

fastai : v.1.0.28
torch v.0.4
cuda : 9.2
gpu : 1080ti
gpu driver : 415
ram : 16

and was update to
torch v1.0.0
cuda 10
cudnn 7.4.5
but still not work

data_lm = TextLMDataBunch.from_folder("./dataset",bs=8)

~/.local/lib/python3.6/site-packages/fastai/ in array(a, *args, **kwargs)
    244     if not isinstance(a, collections.Sized) and not getattr(a,'__array_interface__',False):
    245         a = list(a)
--> 246     return np.array(a, *args, **kwargs)
    248 class EmptyLabel(ItemBase):


last project i was tried Vision with 11gb dataset and working fine, now i tried to train LM using 1.7GB wiki data and give me memory error while used dataset > 200mb. so how to fix this problem ???

I am experiencing this issue in a Kaggle notebook with a batch size of 1 on segmentation. I’ve set num_workers=0.

Training runs, and memory accumulates as the epoch continues, until it crashes at or before the end of the epoch and restarts my kernel.