I’ve been working on the carvana notebook, changing out resnet34 for vgg16 and haven’t been able able to train a model with 1024 images. My notebook kernel died and I got out of memory error messages in syslog with a 30GB paperspace machine, so tried a 100GB gcp instance and still no luck. While I was training, I opened up a new shell to the server and ran
free -m periodically to check the available memory and found that the available memory continuously decreases.
Has anybody else run into these issues?
What batch size are you using?
@Caleb and what are the resolutions of the images like? (I haven’t spent time with that notebook yet)
I’ve tried batch size of 1 and 1 worker on the dataloader. The GPU memory doesn’t seem to be an issue. That stays constant when I check with
Does the resnet34 version train OK for you on the 30GB or 100GB machine?
It looks like its leaking memory as well. This is while the 1024 images are running.
With resnet34 It actually capped out at about 70GB of memory usage about half way through 510 mini batches of 8 images using 2 workers and then memory usage goes down to under 20GB by the end.
vgg16 is actually making it through a complete epoch with the same batch size and workers and exhibiting the same pattern of memory usage.
This previous forum post identifies this issue as
ThreadPoolExecutor greedily pulling batches for each iteration of
self.sampler into memory in Python 3.6 vs. pulling them in lazily in Python 3.5.
I am using an Azure Data Science Virtual Machine which has a P100 GPU.
This is what I am doing:
Load Densenet201 with precompute=True
bs = 64 ( I tried 400 | 300 | 200 | 100), the GPU memory used is 4 Gigs of 16 Gigs Available
sz = 399
I am getting Memory Error, even though I am using just 30% of the total GPU memory available to me. Can anyone help me with this?
MemoryError Traceback (most recent call last)
----> 1 learn = ConvLearner.pr…
There are two workarounds:
0, which then runs batches in a single thread. This resulted in the consumption of a max of 3GB of memory in the scenario above.
Use the dataloader iterator from Pytorch as described
I’m continuing to research ways to make a permanent fix. Any ideas on how to do that or things to look into would be much appreciated.
That’s an interesting point. I’ll see what I can find out about this. Have you tried switching the ThreadPoolExecutor with a ProcessPoolExecutor in the fastai source? (I don’t know if that behaves differently.)
I hacked around this problem by handling the batches a chunk at a time. Fixed for me - let me know if anyone sees any issues. I haven’t tested it carefully for edge cases (e.g. less rows that num_workers*10) so there may be odd bugs still…
Makes sense - that fixed it.
res = self.np_collate([self.dataset[i] for i in indices])
if self.transpose: res = res.T
if self.transpose_y: res = res.T
for batch in map(self.get_batch, iter(self.batch_sampler)):
yield get_tensor(batch, self.pin_memory)
with ThreadPoolExecutor(max_workers=self.num_workers) as e:
# avoid py3.6 issue where queue is infinite and can result in memory exhaustion
for c in chunk_iter(iter(self.batch_sampler), self.num_workers*10):
for batch in e.map(self.get_batch, c): yield get_tensor(batch, self.pin_memory)
return F.log_softmax(l_x, dim=-1)
def save(fn, a): pickle.dump(a, open(fn,'wb'))
def load(fn): return pickle.load(open(fn,'rb'))
def load2(fn): return pickle.load(open(fn,'rb'), encoding='iso-8859-1')
def load_array(fname): return bcolz.open(fname)[:]
def chunk_iter(iterable, chunk_size):
chunk = 
for _ in range(chunk_size): chunk.append(next(iterable))
if chunk: yield chunk