queue.Full Error When Running Lesson1

neovaldivia · November 8, 2017, 7:42pm

Actually it’s happening in Paperspace as well.
Trying with num_workers=4 and/or bs = 32 hasn’t helped so far.
I keep getting Assertion errors

I did a pull yesterday, but will try again…

AssertionError Traceback (most recent call last)
in ()
1 arch=resnet34
2 data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz),num_workers=4)
----> 3 learn = ConvLearner.pretrained(arch, data, precompute=True)
4 learn.fit(0.01, 3)

~/fastai/courses/dl1/fastai/conv_learner.py in pretrained(self, f, data, ps, xtra_fc, xtra_cut, **kwargs)
90 def pretrained(self, f, data, ps=None, xtra_fc=None, xtra_cut=0, **kwargs):
91 models = ConvnetBuilder(f, data.c, data.is_multi, data.is_reg, ps=ps, xtra_fc=xtra_fc, xtra_cut=xtra_cut)
—> 92 return self(data, models, **kwargs)
93
94 @property

~/fastai/courses/dl1/fastai/conv_learner.py in init(self, data, models, precompute, **kwargs)
83 elif self.metrics is None:
84 self.metrics = [accuracy_multi] if self.data.is_multi else [accuracy]
—> 85 if precompute: self.save_fc1()
86 self.freeze()
87 self.precompute = precompute

~/fastai/courses/dl1/fastai/conv_learner.py in save_fc1(self)
130 self.fc_data = ImageClassifierData.from_arrays(self.data.path,
131 (act, self.data.trn_y), (val_act, self.data.val_y), self.data.bs, classes=self.data.classes,
–> 132 test = test_act if self.data.test_dl else None, num_workers=8)
133
134 def freeze(self): self.freeze_to(-self.models.n_fc)

~/fastai/courses/dl1/fastai/dataset.py in from_arrays(self, path, trn, val, bs, tfms, classes, num_workers, test)
289 @classmethod
290 def from_arrays(self, path, trn, val, bs=64, tfms=(None,None), classes=None, num_workers=4, test=None):
–> 291 datasets = self.get_ds(ArraysIndexDataset, trn, val, tfms, test=test)
292 return self(path, datasets, bs, num_workers, classes=classes)
293

~/fastai/courses/dl1/fastai/dataset.py in get_ds(self, fn, trn, val, tfms, test, **kwargs)
274 res = [
275 fn(trn[0], trn[1], tfms[0], **kwargs), # train
–> 276 fn(val[0], val[1], tfms[1], **kwargs), # val
277 fn(trn[0], trn[1], tfms[1], **kwargs), # fix
278 fn(val[0], val[1], tfms[0], **kwargs) # aug

~/fastai/courses/dl1/fastai/dataset.py in init(self, x, y, transform)
165 def init(self, x, y, transform):
166 self.x,self.y=x,y
–> 167 assert(len(x)==len(y))
168 super().init(transform)
169 def get_x(self, i):

AssertionError:

jeremy · November 8, 2017, 7:43pm

That’s an unrelated issue. Looks like you need to delete the data/dogscats/tmp folder.

neovaldivia · November 8, 2017, 8:03pm

thanks, doing removing /tmp folder got me back to the original error

OSError: [Errno 12] Cannot allocate memory

current parameters in Paperspace, basic tier:

data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz),num_workers=4,bs=16)

jeremy · November 8, 2017, 8:07pm

You may need to restart jupyter. Also, do a git pull since I just totally changed all this code.

wluo · November 8, 2017, 8:30pm

Did run across this problem on AWS p2.xlarge (Oregen) with resnet50 with default bs=28, but did not try to reduce it to 14 and don’t know if that will solve the problem …

wluo · November 8, 2017, 8:35pm

@neovaldivia Your data got corrupted, so failed to pass test: x and y are not the same length (line 167)

neovaldivia · November 8, 2017, 10:17pm

removing the tmp folder got that issue fixed. thanks.

Although with this new code push I’m getting an error when importing fastai libraries.
ImportError: dlopen: cannot load any more object with static TLS

Apparently, it’s the order in which the lib’s are imported…but not sure

wluo · November 8, 2017, 10:44pm

Once ssh in, try this to see if it is better: @neovaldivia
$ cd fastai
$ git pull
$ conda env update
$ source activate fastai
$ jupyter notebook

abdulhannanali · November 8, 2017, 10:56pm

I also first assumed it as the batch size, but later discerned since this parameter is passed to tfms_from_model so it should be in some way related to the transformations. Thanks for clarifying out. More documentation with FastAI library can definitely help with such concerns. I will check out github this weekend.

abdulhannanali · November 8, 2017, 11:02pm

I am getting similar errors with the Queue.Full error, going to try out with smaller batch sizes and num_workers option cos maybe I am limited by the memory. Here’s the post

abdulhannanali · November 8, 2017, 11:15pm

Going to try these parameters too, to cope with the memory

naruto79 · November 9, 2017, 2:26am

@jeremy
Guys happy to report the latest changes is stable on my local and noticed the memory now is being manage better without having to use the swap space
A snapshot while running the lesson1 on the cpu and memory resources half way through
23 AM

I don’t need to use other hyper parameters except sz for the latest test
Thanks for fixing this @jeremy !

jeremy · November 9, 2017, 3:04am

And today I learned from @naruto79 about nmon

naruto79 · November 9, 2017, 3:06am

Did another test with hyperparameter bs=128, seem to be stable and able to improve the execution time ! and of course it will use more GPU memory

neovaldivia · November 9, 2017, 3:27am

no luck…but thank you.

I actually switched to AWS, I was using Paperspace but couldn’t invest more time in that bug anymore. =)

No issues in AWS with num_channels = 4

ravimahar · November 9, 2017, 7:12pm

I am getting the same ImportError on PAPERSPACE console

I did git pull and conda env -f environment.yml still getting error.
See error below when I am doing fastai import

Error on below command in the cell.
from fastai.transforms import
Error:
ImportError Traceback (most recent call last)
in ()

~/fastai/courses/dl1/fastai/torch_imports.py in ()
1 import os
----> 2 import torch, torchvision, torchtext
3 from torch import nn, cuda, backends, FloatTensor, LongTensor, optim
4 import torch.nn.functional as F
5 from torch.autograd import Variable

~/anaconda3/lib/python3.6/site-packages/torch/init.py in ()
51 sys.setdlopenflags(_dl_flags.RTLD_GLOBAL | _dl_flags.RTLD_NOW)
52
—> 53 from torch._C import *
54
55 all += [name for name in dir(_C)

ImportError: dlopen: cannot load any more object with static TLS

I did below also as per @wgpubs
git pull
conda env update -f environment.yml
restart your terminal
still not getting rid of it.

jeremy · November 10, 2017, 2:53am

Did you source activate fastai? Does it work OK for you on AWS using fastai AMI?

abdulhannanali · November 10, 2017, 8:06pm

Anything I did even source activate fastai, this bug wasn’t going away so I abandoned Paperspace instead of wasting even more time on it.

ravimahar · November 10, 2017, 8:45pm

Yes i did source activate fastai still this error does not go. I have not tried on AWS using fastai AMI.

I am also running into permission error while running below.
os.makedirs(’/cache/tmp’, exist_ok=True)
!ln -fs /cache/tmp {PATH}

{Errno 13] Permission denied: ‘/cache’
Lot of issues guess I should also move to AWS or crestle.

ravimahar · November 10, 2017, 8:55pm

This is working for me in Crestle , looks I am good now