Memory leak in dataloader?

Caleb · March 24, 2018, 2:39pm

I’ve been working on the carvana notebook, changing out resnet34 for vgg16 and haven’t been able able to train a model with 1024 images. My notebook kernel died and I got out of memory error messages in syslog with a 30GB paperspace machine, so tried a 100GB gcp instance and still no luck. While I was training, I opened up a new shell to the server and ran free -m periodically to check the available memory and found that the available memory continuously decreases.

Has anybody else run into these issues?

carvana-vgg16.ipynb

jsonm · March 24, 2018, 3:28pm

What batch size are you using? @Caleb and what are the resolutions of the images like? (I haven’t spent time with that notebook yet)

Caleb · March 24, 2018, 4:03pm

I’ve tried batch size of 1 and 1 worker on the dataloader. The GPU memory doesn’t seem to be an issue. That stays constant when I check with nvidia-smi.

jeremy · March 24, 2018, 10:29pm

Does the resnet34 version train OK for you on the 30GB or 100GB machine?

Caleb · March 24, 2018, 11:35pm

It looks like its leaking memory as well. This is while the 1024 images are running.

With resnet34 It actually capped out at about 70GB of memory usage about half way through 510 mini batches of 8 images using 2 workers and then memory usage goes down to under 20GB by the end.

vgg16 is actually making it through a complete epoch with the same batch size and workers and exhibiting the same pattern of memory usage.

Caleb · March 25, 2018, 5:53am

This previous forum post identifies this issue as ThreadPoolExecutor greedily pulling batches for each iteration of self.sampler into memory in Python 3.6 vs. pulling them in lazily in Python 3.5.

There are two workarounds:

Set num_workers to 0, which then runs batches in a single thread. This resulted in the consumption of a max of 3GB of memory in the scenario above.
Use the dataloader iterator from Pytorch as described here.

I’m continuing to research ways to make a permanent fix. Any ideas on how to do that or things to look into would be much appreciated.

https://github.com/fastai/fastai/blob/master/fastai/dataloader.py

jeremy · March 25, 2018, 4:39pm

That’s an interesting point. I’ll see what I can find out about this. Have you tried switching the ThreadPoolExecutor with a ProcessPoolExecutor in the fastai source? (I don’t know if that behaves differently.)

jeremy · March 25, 2018, 7:44pm

I hacked around this problem by handling the batches a chunk at a time. Fixed for me - let me know if anyone sees any issues. I haven’t tested it carefully for edge cases (e.g. less rows that num_workers*10) so there may be odd bugs still…

Caleb · March 25, 2018, 7:59pm

Makes sense - that fixed it.

github.com

fastai/fastai/blob/c0e9bd96e7599898e215a7e0668bae357020a1fb/fastai/dataloader.py#L81


    res = self.np_collate([self.dataset[i] for i in indices])
    if self.transpose:   res[0] = res[0].T
    if self.transpose_y: res[1] = res[1].T
    return res


def __iter__(self):
    if self.num_workers==0:
        for batch in map(self.get_batch, iter(self.batch_sampler)):
            yield get_tensor(batch, self.pin_memory)
    else:
        with ThreadPoolExecutor(max_workers=self.num_workers) as e:
            # avoid py3.6 issue where queue is infinite and can result in memory exhaustion
            for c in chunk_iter(iter(self.batch_sampler), self.num_workers*10):
                for batch in e.map(self.get_batch, c): yield get_tensor(batch, self.pin_memory)

github.com

fastai/fastai/blob/c0e9bd96e7599898e215a7e0668bae357020a1fb/fastai/core.py#L122


    return F.log_softmax(l_x, dim=-1)




def save(fn, a): pickle.dump(a, open(fn,'wb'))
def load(fn): return pickle.load(open(fn,'rb'))
def load2(fn): return pickle.load(open(fn,'rb'), encoding='iso-8859-1')


def load_array(fname): return bcolz.open(fname)[:]




def chunk_iter(iterable, chunk_size):
while True:
    chunk = []
    try:
        for _ in range(chunk_size): chunk.append(next(iterable))
        yield chunk
    except StopIteration:
        if chunk: yield chunk
        break

maple · March 21, 2023, 2:11am

There is a NEW TOOL called cstl ( GitHub - fuzihaofzh/cstl: The C++ Standard Template Library (STL) for Python. ) that can solve this problem. It wraps C++ STL containers to solve this issue. It supports multiple types including nested map, list, and set which the numpy and pytorch do not support.
Here is a simple example showing how it solves the problem:

from torch.utils.data import Dataset, DataLoader
import numpy as np
import torch
import copy
import sys
import cstl
from tqdm.auto import tqdm


class DataIter(Dataset):
    def __init__(self):
        cnt = 24000000
        self.cnt = cnt
        #self.data = np.array([x for x in range(cnt)]) # Good
        #self.data = [x for x in range(cnt)] #Leaky
        #self.data = cstl.MapIntInt({i : i for i in range(24000000)})# Good
        self.data = cstl.VecInt(range(24000000)) # Good

        
    def __len__(self):
        return self.cnt

    def __getitem__(self, idx):
        data = self.data[idx]
        data = np.array([int(data)], dtype=np.int64)
        return torch.tensor(data)

train_data = DataIter()
train_loader = DataLoader(train_data, batch_size=300,
                          shuffle=True,
                          drop_last=True,
                          pin_memory=False,
                          num_workers=18)

for i, item in tqdm(enumerate(train_loader)):
    torch.cuda.empty_cache()
    if i % 1000 == 0:
        print(i)