Kernels dies when using fastai library due to CPU memory issue

Deepak_S · March 6, 2018, 2:00pm

Even I am working on replicating the planet classification challenge on Google Landmark Recognition Challenge (will use softmax instead of sigmoid) on Kaggle. It fails at the get_data(sz = batch_size) step even for a batchsize of 16. Fills up all the RAM and crashes(then restarts) the notebook. Is there a way to lazily load a batch at a time? The dataset is about 336 Gb large and I am working with 56 GB RAM + K80 graphics on Azure’s NC6 VM.

farhanhubble · March 7, 2018, 1:18am

It’s possible that ThreadPoolExecutor is being used elsewhere too. I faced the issue in lesson1.ipynb. I used tracemalloc to drill down to the exact location that was allocating memory. It was the np_collate() function which was being called by thread pool executer’s run() function. In my case it was happening when creating batches and saving them. It’s very likely that the same is happening when loading the batches for training.

srmsoumya · March 7, 2018, 3:36pm

Farhan’s fix works in my case, but the training time increases by many folds. It will be really helpful if someone could fix this issue.

wnurmi · March 7, 2018, 4:04pm

I think there are two seperate issues here. Farhan’s workaround is for issue 184 which manifests in DataLoader.

I believe the issue I’m experiencing with the Landmark dataset is the same one Bart and Deepak reported (at least all symptoms are same). It manifest before DataLoader is used. I’ve pinpointed it to
dataset.py:138 in the label handling and am looking to fix it.

wnurmi · March 7, 2018, 8:28pm

Submitted a PR that should buy us 4 times more training data. Still not enough so let me know if you can think of a more complete fix!

j.laute · March 8, 2018, 2:22pm

I did a temporary for the ThreadPoolExecutor problem, while keeping multithreading.
I just copied the _DataLoaderIter object from pytorch and added a few lines. I hope that a better solution can be found soon.
Did anybody find out why python changed this behavior?

Here the link to the fix: dataloader.py and dataloaderiter.py

Cheers, Johannes

ecocarlisle · March 12, 2018, 3:22am

I’m seeing similar issue when trying to run the Quick Dogs vs.Cats notebook. The kernel just dies when I run the following line of code:

learn = ConvLearner.pretrained(resnet50,data)

superives · March 16, 2018, 3:59am

Hey William,

The same thing happened to me. I am using paperspace and when I try to do the last step TTA for bs = 256, it always died.
Have you fixed the problem?

wnurmi · March 16, 2018, 8:27am

Yeah, I know. Making predictions seems surprisingly memory intensive :-/. I haven’t had an in depth look at it. There might well be an easy mitigation like the one in the PR above. I have only worked around the issue by predicting in smaller batches. Which is a pain.

EpiphanyMania · March 16, 2018, 4:48pm

Yea. During validation, my kernel jus dies, there are around 20000 images of 1mb each, might be causing the problem. Any idea how do I stop the kernel from dying during validation? Running GPU+ instance on paperspace (8gb vRAM with 30gb ram)

xtermz · March 20, 2018, 3:25pm

Farhan’s fix worked for me — thanks Farhan!

My dataset is ~190k images from this competition: https://www.kaggle.com/c/imaterialist-challenge-furniture-2018. I resized my images to 500px using data.resize(). These temporary images consume < 20 GB on my disk.

Before Farhan’s fix:

With ConvLearner.pretrained (resnet34), GPU memory usage was 30%, (batch size 48), but all 32 GB of RAM was eaten up and virtual memory usage grew to 80 GB before the Kernel crashed.

After Farhan’s fix:

With ConvLearner.pretrained (resnet34), GPU memory continues to remain 30%, (batch size 48), but only 3 GB of RAM is used and virtual memory is at a steady 20 GB (no Kernel crashes).

farlion · March 20, 2018, 6:55pm

Currently attempting your fix and it seems to run like a charm…

j.laute · March 21, 2018, 9:10am

Just checked with the fastai repo, so right now when using 0 workers it will not load all into memory, but if you try multi threading it will. If the fix works I could submit a pull request @jeremy
cheers Johannes

farlion · March 21, 2018, 10:16am

Okay so it was zipping along fine during training and seemed to make use of multiple threads while not keeping every batch in memory, but then failed at the end of the learn.fit() call with the following err:

j.laute · March 21, 2018, 10:33am

Hmm, can you share a full stack trace or provide a minimal code example that fails. I will try to fix it later.
cheers, Johannes

farlion · March 21, 2018, 11:24am

Nevermind Johannes, looks like this was unrelated! Apologies. Will update later if I run into any troubles!

wspride · March 21, 2018, 5:30pm

@j.laute 's fix worked for me as well - cheers, well done. Would support this being PRed into master

DhruvJawalkar · March 21, 2018, 9:54pm

Hi there,
I’ve been facing the same issue too!

So, I’ve tried to do the TTA separately for ‘test-jpg’ and ‘test-jpg-additional’ folders. Created separate result csv’s and finally merged them using
‘result = pd.concat([test_df, addn_df])’

And it worked.
Hope this solves for you too.
(Just be watchful of the filenames while creating addn_df and ‘index=False’ during .to_csv())

superives · March 21, 2018, 10:27pm

Thanks for the advice, I will look into it!

heisenburgzero · March 22, 2018, 12:06pm

I think its raising problem in other places too. Such as functions in metrics.py