Kernels dies when using fastai library due to CPU memory issue

j.laute · March 21, 2018, 10:33am

Hmm, can you share a full stack trace or provide a minimal code example that fails. I will try to fix it later.
cheers, Johannes

farlion · March 21, 2018, 11:24am

Nevermind Johannes, looks like this was unrelated! Apologies. Will update later if I run into any troubles!

wspride · March 21, 2018, 5:30pm

@j.laute 's fix worked for me as well - cheers, well done. Would support this being PRed into master

DhruvJawalkar · March 21, 2018, 9:54pm

Hi there,
I’ve been facing the same issue too!

So, I’ve tried to do the TTA separately for ‘test-jpg’ and ‘test-jpg-additional’ folders. Created separate result csv’s and finally merged them using
‘result = pd.concat([test_df, addn_df])’

And it worked.
Hope this solves for you too.
(Just be watchful of the filenames while creating addn_df and ‘index=False’ during .to_csv())

superives · March 21, 2018, 10:27pm

Thanks for the advice, I will look into it!

heisenburgzero · March 22, 2018, 12:06pm

I think its raising problem in other places too. Such as functions in metrics.py

j.laute · March 22, 2018, 2:24pm

Please share some code to reproduce if possible. Maybe we need to call to_tensor or something in the DataLoaderIter, I will investigate this weekend

Cheers, Johannes

wspride · March 24, 2018, 2:44pm

@j.laute I’m running into the same issue reported by @heisenburgzero. I’ve uploaded my notebook here and a Gist of the stack trace I get here. Let me know if you have any thoughts.

wspride · March 24, 2018, 7:45pm

Update, resolved this (at least locally) by re-creating the calls to get_tensor that look to have been lost in @j.laute’s original fix. Demonstrated here

Caleb · March 25, 2018, 3:22pm

Any suggestions on how to make a permanent fix? Is there anything that can be done to the ThreadPoolExecutor construction in dataloader.py?

https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example

j.laute · March 26, 2018, 2:42pm

Amazing! Will change that as well. Didn’t have time yet to submit a pr to master, hopefully tomorrow.

Cheers, Johannes

jeremy · March 26, 2018, 3:21pm

This should now be fixed in git. Let me know if you see any further problems.

aragorn · March 27, 2018, 6:18am

I am still facing the same issue after the new update. Can anyone confirm if it works for them now after the update?

Kris · March 27, 2018, 2:26pm

I have the error as well. The code seems to have been updated in the dataloader.py file but solution is apparently not working for us.

ecocarlisle · March 30, 2018, 5:18pm

Git pull fixed the issues for me. Thank you!

BartP · April 3, 2018, 11:16am

It has not been completely solved, although it is better. (Thanks for improving the code)

Indeed, fitting the data uses less RAM (although still a lot). After an epoch the memory gets freed, but when swap was needed only part of that memory is getting freed. So when too many epochs are run, the kernel might still die.

For the Amazon data set and the code I had the fix is good enough, but I think that in some cases this won’t solve the issue completely.

harveynick · April 8, 2018, 9:44pm

I’m getting this same error when running ImageClassifierData.from_csv against the yelp dataset (which is marked as “extra large”). I’ve verified that it is memory related using top.

Seems odd, since I wouldn’t expect this code to actually load the images into memory.

Note: I tried a git pull and a conda env update.

harveynick · April 9, 2018, 9:41pm

Ignore that. I had a bug in my data processing code.

dumbPy · May 1, 2018, 10:35am

Similar problem. How did you fixed that? @harveynick
I am loading data from csv, for Landmark Recognition Challenge and still can’t get it work.
Kernel keeps dying. I am using Google Cloud with 26GB Ram and 1xK80 GPU

Ankit89 · May 3, 2018, 6:38am

@wnurmi , i actually did the changes mentioned in your PR but still Kernel is dying. I am working on same Google Image recognition and Kernel dies while using ImageClassifierData.from_csv. Did any solution worked for you? if so, could you let me know what changes can be done and how?