CPU RAM Usage Keeps Growing as Training One Cycle

kcturgutlu · November 20, 2018, 2:50am

Hi,

So I am training a model with one cycle for 1 epoch for a Kaggle competition (google doodle). My dataset consist of 70K * 340 (NUM CLASS) many samples. I am using batch size of 800 (as much as the GPU memory allows me). The code is a modified version of @radek 's Fast.ai starter pack.

In my first try I set dataloader’s num_workers=8 to utilize the multiprocessing, but had SIGKILLs. There are many related issues in pytorch forums: https://discuss.pytorch.org/t/runtimeerror-dataloader-worker-pid-26317-is-killed-by-signal-aborted/16879 and probably here as well.

I’ve tried changing create_func and my dataloading process but still couldn’t get over this problem. At the moment I’ve reduced the num_workers=4 and using a smaller batchsize=200. ETA of training is 10 h and it’s still going on smoothly.

My real question is this:

I am seeing cpu memory usage doesn’t fluctuate around a mean value but rather it linearly increase. Shouldn’t CPU memory usage be theoretically at most max(size(item))*(batch size). How can CPU memory usage linearly goes up as training continues?

Here is my script if it helps to understand what i am running (48 lines):

github.com

KeremTurgutlu/quickdraw/blob/master/MobileNet.py

import pandas as pd
import numpy as np
from pathlib import Path
from fastai import *
from fastai.vision import *
from utils import *

NUM_VAL = 50 * 340

PATH = Path('../data/quickdraw/')

bs = 200
sz = 256
test_df = pd.read_csv(PATH/"test_simplified.csv")

def create_func(item):
    with open(item) as f: item = f.read()
    arr = list2drawing(eval(item), size=sz, lw=6, time_color=True)
    img = cv2.cvtColor(arr, cv2.COLOR_GRAY2RGB)
    tensor = torch.from_numpy(img).float()

This file has been truncated. show original

Thanks

radek · November 20, 2018, 7:11am

You mean CPU memory as in your computer’s RAM? You are right, this shouldn’t be happening, I have not seen this myself (though I have not tried the arch you are using).

Not sure that this will solve the issue, but what I would try would be getting a new pytorch nightly and potentially moving to a new version of fastai. This is the commit hash I was on when creating the starter pack: af068ecf9ba98c5c3383b59bb2a7d44b01337297.

kcturgutlu · November 20, 2018, 7:39am

Thanks for the reply, the code is running on a fresh instance that I started just today actually. So both pytorch and fastai are the newest version.

Yes, I meant RAM. For example, first couple of hours it was around 17 GB now it’s around 38 GB after 7 hours.

radek · November 20, 2018, 8:17am

That is a very big jump. I’ll try to keep an eye on the RAM myself when I train a model using this again (most likely today or tomorrow).

There is some chance the pid kills and RAM usage are not related. Chances are if you were getting the SIGKILLs and other people were not that this is related to the version of the pytorch nightly that you are using. If this should be of help I am on:

cuda92                    1.0                           0    pytorch
pytorch-nightly           1.0.0.dev20181114 py3.7_cuda9.2.148_cudnn7.1.4_0  [cuda92]  pytorch

(This is the second nightly I tried as I was getting some weird issues with some other version, guess this is to be expected with nightlies).

This might be a bit of a far fetched thought, but my reasoning is as follows -> for the RAM usage to grow, something must not be getting cleaned up / is stored. An unusual thing that you are doing is having a super long epoch (50 million samples! ) Might be a small thing that normally is not even noticed but here is exacerbated by the number of iterations, etc.

I might be seeing things but it seems that RAM usage climbs very slowly through the epoch - when I train with 200k epochs, it increases by ~2GB, than when it goes to validating all this seems to be cleaned up…

I don’t know what could be getting saved. Would probably even make sense to not only consider the training loop but also fastprogress - maybe something is up with the exponention weighted average.

Quite far fetched ideas. I don’t think I will have the time to troubleshoot this nor am not sure that there is much value in that - probably even epoch sizes of 200k are on the large size of things, not something users are likely to encounter. But what I can do and what probably would be a better use of our time here is I can share the code I use to sample epochs of 200k? This might be a better approach. Let me know please if that is something that you would be willing to try.

kcturgutlu · November 20, 2018, 10:34am

Thanks for the detailed explanation. Yeah, I would love to try your approach. It makes lot of sense actually, storing data might be an issue when epoch size is huge. Error I suppose is most probably an out of memory which causes killed workers. I hope to finish full training in couple of hours with MobileNet model, seems to converge very well

Do you know if it’s possible to save checkpoints using callbacks or any other option during a single epoch ?

Thanks Again !

radek · November 20, 2018, 10:54am

You could try running something in one of the batch callbacks (for instance, on_batch_begin) and you could use the SaveModelCallback from here https://github.com/fastai/fastai/blob/master/fastai/callbacks/tracker.py for inspiration (but instead of saving when some metric improves, you could run it every set amount of epochs).

This is a sampler that I have been using for quite some time now - the new data block API makes it super simple to use a custom sampler, which I think I super nice, because it opens a way to easily experiment with things like negative hard mining or some other way of constructing the batches (which somewhat seems the next natural step on top of one cycle that I don’t think has been explored to a great extent yet, I think for some of the imagenet training in record time the fastai crew did something along those lines but I might be misremembering things).

Anyhow, here is the sampler:

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

class RandomSamplerWithEpochSize(Sampler):
    """Yields epochs of specified sizes. Iterates over all examples in a data_source in random
    order. Ensures (nearly) all examples have been trained on before beginning the next iteration
    over the data_source - drops the last epoch that would likely be smaller than epoch_size.
    """
    def __init__(self, data_source, epoch_size):
        self.n = len(data_source)
        self.epoch_size = epoch_size
        self._epochs = []
    def __iter__(self):
        return iter(self.next_epoch)
    @property
    def next_epoch(self):
        if len(self._epochs) == 0: self.generate_epochs()
        return self._epochs.pop()
    def generate_epochs(self):
        idxs = [i for i in range(self.n)]
        np.random.shuffle(idxs)
        self._epochs = list(chunks(idxs, self.epoch_size))[:-1]
    def __len__(self):
        return self.epoch_size

and this is how I construct the data_bunch:

train_dl = DataLoader(
    label_lists.train,
    num_workers=12,
    batch_sampler=BatchSampler(RandomSamplerWithEpochSize(label_lists.train, 200_000), bs, True)
)
valid_dl = DataLoader(label_lists.valid, 2*bs, False, num_workers=12)
test_dl = DataLoader(label_lists.test, 2*bs, False, num_workers=12)

data_bunch = ImageDataBunch(train_dl, valid_dl, test_dl)

kcturgutlu · November 20, 2018, 11:27am

Using this sampler + savemodel callback will be nice, thanks !

shub.chat · November 25, 2018, 4:09am

Where you able to solve for this ?I am getting sigkills when working with train size >10mm,I know on pytorch forums there are issues open which looks to be about shared memory usage allocation while using >0 num_workers.
I tried increasing shmax but still not able to use multiprocessing when number of iterations in an epoch is large

kcturgutlu · November 27, 2018, 9:51am

In my case the issue was memory, so reducing batch size and using num_workers=4 helped.

devforfu · December 3, 2018, 4:09pm

Yeah, I have the same issue with this dataset as well, fighting for a few weeks already

Also, it seems that using plain PyTorch training loop (nightly version) shows the same memory leakage. I am not sure why the amount of the used RAM is linearly increasing during a single training epoch. Shouldn’t the samples be released after a batch is completed?

Kaushik · May 13, 2021, 2:51am

I have the same issue as well. No idea why this is happening, been stuck with this problem for a while now.

avinab · October 2, 2022, 10:41pm

Did anyone find a solution to this?

My RAM usage increases from 64Gb to 250Gb over the training 25 epochs.