Have you done conda env update
? What’s your machine/OS/etc? What do you see if you display learner.activations
before fitting?
Also does git status
show you have any local changes? If so, try reverting them.
Have you done conda env update
? What’s your machine/OS/etc? What do you see if you display learner.activations
before fitting?
Also does git status
show you have any local changes? If so, try reverting them.
I have a GTX1070 running on Ubuntu 16.04, Cuda 8.0.
No I didn’t run conda env update
.
I discarded any local changes before doing the git pull
.
See below is my output from learner.activations
[carray((23000, 1024), float32)
nbytes: 89.84 MB; cbytes: 93.01 MB; ratio: 0.97
cparams := cparams(clevel=5, shuffle=1, cname='blosclz')
rootdir := 'data/dogscats/tmp/x_act_resnet34_0_224.bc'
mode := 'a'
[[ 2.44967175 2.99363661 4.33913231 ..., 0.13268621 2.84894609
0.40972656]
[ 4.21488714 5.84484339 11.52380562 ..., 0.25312638 1.17774951
0.04299501]
[ 0.79310578 3.97656846 9.34537125 ..., 0.56532079 1.67562604
0.56701189]
...,
[ 2.92073274 2.12234163 5.58415794 ..., 0.93900502 0.07328516
0.24477248]
[ 1.85395586 11.13914871 1.54042947 ..., 2.02693963 0.58828402
0.43189242]
[ 0.94521332 1.10611272 4.41981649 ..., 0.04808487 1.97100222
0.54948312]], carray((2000, 1024), float32)
nbytes: 7.81 MB; cbytes: 8.09 MB; ratio: 0.97
cparams := cparams(clevel=5, shuffle=1, cname='blosclz')
rootdir := 'data/dogscats/tmp/x_act_val_resnet34_0_224.bc'
mode := 'a'
[[ 1.29020107 2.32586789 6.74245548 ..., 0.70285255 1.60787117
0.0493584 ]
[ 4.4050312 4.15314722 10.62083817 ..., 0.51982337 1.14450645
0.43660945]
[ 2.00788641 1.18459272 5.34251404 ..., 0.11724961 1.58721673
0.09959023]
...,
[ 1.81696892 2.15313721 8.69116592 ..., 0.94192082 0.56452471
0.07494064]
[ 1.2544719 3.25818467 4.1094799 ..., 0.60879725 0.04835358
0.73797786]
[ 5.00699234 2.93570065 4.85688496 ..., 0.19603917 1.10524523
0.09501588]], carray((0, 1024), float32)
nbytes: 0; cbytes: 4.00 KB; ratio: 0.00
cparams := cparams(clevel=5, shuffle=1, cname='blosclz')
rootdir := 'data/dogscats/tmp/x_act_test_resnet34_0_224.bc'
mode := 'a'
[]]
I had the same problem with that line learn.fit(1e-2, 3, cycle_len=1)
running on my home DL rig (Ubuntu + GTX 1080Ti), it would not even start.
It turned out that I was running an older version of fast.ai github. As Jeremy pointed out in the thread, once I uploaded the latest version, it worked fine.
I have a little workaround…As long as I just point to Pytorch’s DataLoader class (instead of the custom one) everything works as expected on my machine without any memory issues with the most recent version of the repo. I wouldn’t want to hold anyone up debugging an issue that only appears to be happening on my machine. So all is good on my end!
Thanks - I suspect someone else will have the same problem, but we’ll see…
You’re not alone
I’ll try and look deeper into it tomorrow, following your tips.
Does decreasing num_workers when creating data
also decrease your memory use?
num_workers=0
generates an error max_workers must be greater than 0
in Cell #35
ims = np.stack([get_augs() for i in range(6)])
Trying num_workers=1
: apart from being a lot slower, as my CPU is an I5-4690K with 4 cores (4 workers capable ?), it still slowly but surely filled up the 16Gb RAM before 50% of the 1st epoch and reached another 19Gb of the Swap before finishing, in Cell #41 learn.fit(1e-2, 3, cycle_len=1)
Once finished, the 16Gb of RAM are still fully-used, and the Swap has 7Gb used.
Can you try downloading the latest version of anaconda and installing that from scratch, and doing conda env update
there, to see if there’s some module issue going on?
azerty will try
I’m not alone anymore If you’re looking for a temporary workaround so you can run the notebooks, the following minor changes worked for me:
Comment this line in dataset.py
#from .dataloader import DataLoader
Change this line in torch_imports.py
from torch.utils.data import DataLoader, Dataset, TensorDataset
(Add DataLoader to the torch.utils.data imports)
Thanks @jamesrequa
Your 2 lines did the trick, both the Lesson 1 and Lesson 2 notebooks run fine without loading over 10Gb of RAM, and no Swap.
There’s a single error message Widget Javascript not detected. It may not be installed or enabled properly.
that keeps popping up at every epoch cell, but I can’t see an actual difference in display with the saved notebooks pre-mods.
can relate, i tried reading some of them full of mysteries and cryptic info
Hey Everyone,
Was everybody else able to resolve their issues, I break the Jupyter notebook everytime at some step because of a MemoryError or a BrokenProcessPool error, and the only thing left at that point is to back start from scratch train the models again and then repeat, it’s kinda frustrating. I think the major problem behind these errors is the Limited memory that comes with the g2.2x large instance having Tesla K520 Grid GPU i.e. 4096MiB. Can’t figure out the issues but I hope someone else can help here, it does get overloaded very quickly with GPU training.
Right now, it seems g2.2x is one of the instances that Amazon is going to retire very soon, since it’s mentions are nowhere to be found in their listings for Accelerated Computing, maybe not many people are using it. But for now I want to make it work.
Changing the num_workers
from default of 8 to 4 and 0 doesn’t work because of AssertionError
and changing batch size from 64
to 32
to 2
, these sometimes work but make the training extremely slow, which is not what I want right now. Nothing seems to solve process pool errors, that I encounter at some point or other. Another thing that sometimes work is skipping right away to the step I want to perform leaving the other steps behind, but since there’s a sequence @jeremy specifies at the end of notebook, it’s not good enough for state of the art model.
I got this while running learn.fit(lr, 3, cycle_len=1, cycle_mult=2)
, clueless about this error but it mentions both process pool and calls itself a queuing error, so I think it’s happening because of memory being full.
Note: I only reached this step, as I rebooted my computer, left the intermediary steps and just ran the data augmentation and this part successfully, batch sizes and other options were normal for ImageClassifierData
A Jupyter Widget
9%|▉ | 34/360 [01:05<10:30, 1.93s/it, loss=0.0872]
Process Process-75:
MemoryError
Traceback (most recent call last):
File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/hannan/.conda/envs/fastai/lib/python3.6/concurrent/futures/process.py", line 181, in _process_worker
result=r))
File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/queues.py", line 341, in put
obj = _ForkingPickler.dumps(obj)
File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
10%|▉ | 35/360 [01:08<10:33, 1.95s/it, loss=0.0862]
Exception in thread Thread-24:
Traceback (most recent call last):
File "/home/hannan/.conda/envs/fastai/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/hannan/.conda/envs/fastai/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/hannan/.conda/envs/fastai/lib/python3.6/concurrent/futures/process.py", line 295, in _queue_management_worker
shutdown_worker()
File "/home/hannan/.conda/envs/fastai/lib/python3.6/concurrent/futures/process.py", line 253, in shutdown_worker
call_queue.put_nowait(None)
File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/queues.py", line 129, in put_nowait
return self.put(obj, False)
File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/queues.py", line 83, in put
raise Full
queue.Full
Try doing a git pull
. I’ve just changed how this runs.
I was facing the same issue and after lot of tries somehow reached this article.
@jamesrequa : Your trick did solve the problem. Thanks
@jeremy : I am running the latest version of code from git but the problem is still there. I have 8GB Ram along with the GPU but the python notebook takes up whole CPU and hangs the system.
Which article?
@ecdrid I meant this forum post in which we are talking
Sorry for the confusion.
I’ve experienced the same problem, after a fresh Anaconda install and everything up to date, on Ubuntu 17.10. The trick of changing the DataLoader stopped the RAM issue but after a while I get a error about incompatible variables types.
I’ll write here to get notified in case of a solution!
@Mirko I’m in the exact same boat as you. Looks like it happened during the prediction on the validation set phase.
Edit: I see now it is during the accuracy calculation after one cycle of training. Probably something to do with the ‘metrics now require tensors’ commit