queue.Full Error When Running Lesson1

@jeremy assume the num_workers is sz variable in the code ? seem like if I change from 224 to 64, I didn’t hit this problem. Will also do an update later. Thanks.

1 Like

@jeremy just did package update for both conda and pip, seem doesn’t help much but adjusting the sz to a lower value help in my case. Seem like package update bump up the speed !

its not a batch size - thats an image size. batch size is bs passed to ImageClassifierData.
Reducing image size will do the trick as well but probably not in a way you want it to.

2 Likes

Ah got it, I need to pass in num_workers to
ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz), bs=bs, num_workers=num_workers)

Thanks !

@jeremy seem like I need to use num_workers=2, maybe my CPU is the lower range type I guess which is a core i3, just curious what does num_workers do ?

1 Like

It would be helpful if you can show all your hyper parameters: sz, bs, etc.

Also check this post by @abdulhannanali - this way you can hide long error messages.

1 Like

My current hyper parameters used :-
sz=224
bs=128
num_workers=2

Thanks @sermakarevich for the details, will do that next time.

OSError Traceback (most recent call last) in () ----> 1 log_preds,y = learn.TTA() 2 accuracy(log_preds,y) /opt/project/fastai/courses/dl1/fastai/learner.py in TTA(self, n_aug, is_test) 111 preds1,targs = predict_with_targs(self.model, dl1) 112 preds1 = [preds1]*math.ceil(n_aug/4) --> 113 preds2 = [predict_with_targs(self.model, dl2)[0] for i in range(n_aug)] 114 return np.stack(preds1+preds2).mean(0), targs 115

/opt/project/fastai/courses/dl1/fastai/learner.py in (.0)
111 preds1,targs = predict_with_targs(self.model, dl1)
112 preds1 = [preds1]*math.ceil(n_aug/4)
–> 113 preds2 = [predict_with_targs(self.model, dl2)[0] for i in range(n_aug)]
114 return np.stack(preds1+preds2).mean(0), targs
115

/opt/project/fastai/courses/dl1/fastai/model.py in predict_with_targs(m, dl)
121 if hasattr(m, ‘reset’): m.reset()
122 preda,targa = zip(*[(get_prediction(m(*VV(x))),y)
–> 123 for *x,y in iter(dl)])
124 return to_np(torch.cat(preda)), to_np(torch.cat(targa))
125

/opt/project/fastai/courses/dl1/fastai/model.py in (.0)
120 m.eval()
121 if hasattr(m, ‘reset’): m.reset()
–> 122 preda,targa = zip(*[(get_prediction(m(*VV(x))),y)
123 for *x,y in iter(dl)])
124 return to_np(torch.cat(preda)), to_np(torch.cat(targa))

/opt/project/fastai/courses/dl1/fastai/dataset.py in next(self)
226 if self.i>=len(self.dl): raise StopIteration
227 self.i+=1
–> 228 return next(self.it)
229
230 @property

/opt/project/fastai/courses/dl1/fastai/dataloader.py in iter(self)
75 def iter(self):
76 with ProcessPoolExecutor(max_workers=self.num_workers) as e:
—> 77 for batch in e.map(self.get_batch, iter(self.batch_sampler)):
78 yield get_tensor(batch, self.pin_memory)
79

~/.conda/envs/tf-gpu/lib/python3.6/concurrent/futures/process.py in map(self, fn, timeout, chunksize, *iterables)
482 results = super().map(partial(_process_chunk, fn),
483 _get_chunks(*iterables, chunksize=chunksize),
–> 484 timeout=timeout)
485 return itertools.chain.from_iterable(results)
486

~/.conda/envs/tf-gpu/lib/python3.6/concurrent/futures/_base.py in map(self, fn, timeout, chunksize, *iterables)
546 end_time = timeout + time.time()
547
–> 548 fs = [self.submit(fn, *args) for args in zip(*iterables)]
549
550 # Yield must be hidden in closure so that the futures are submitted

~/.conda/envs/tf-gpu/lib/python3.6/concurrent/futures/_base.py in (.0)
546 end_time = timeout + time.time()
547
–> 548 fs = [self.submit(fn, *args) for args in zip(*iterables)]
549
550 # Yield must be hidden in closure so that the futures are submitted

~/.conda/envs/tf-gpu/lib/python3.6/concurrent/futures/process.py in submit(self, fn, *args, **kwargs)
452 self._result_queue.put(None)
453
–> 454 self._start_queue_management_thread()
455 return f
456 submit.doc = _base.Executor.submit.doc

~/.conda/envs/tf-gpu/lib/python3.6/concurrent/futures/process.py in _start_queue_management_thread(self)
413 if self._queue_management_thread is None:
414 # Start the processes so that their sentinels are known.
–> 415 self._adjust_process_count()
416 self._queue_management_thread = threading.Thread(
417 target=_queue_management_worker,

~/.conda/envs/tf-gpu/lib/python3.6/concurrent/futures/process.py in _adjust_process_count(self)
432 args=(self._call_queue,
433 self._result_queue))
–> 434 p.start()
435 self._processes[p.pid] = p
436

~/.conda/envs/tf-gpu/lib/python3.6/multiprocessing/process.py in start(self)
103 'daemonic processes are not allowed to have children’
104 _cleanup()
–> 105 self._popen = self._Popen(self)
106 self._sentinel = self._popen.sentinel
107 _children.add(self)

~/.conda/envs/tf-gpu/lib/python3.6/multiprocessing/context.py in _Popen(process_obj)
221 @staticmethod
222 def _Popen(process_obj):
–> 223 return _default_context.get_context().Process._Popen(process_obj)
224
225 class DefaultContext(BaseContext):

~/.conda/envs/tf-gpu/lib/python3.6/multiprocessing/context.py in _Popen(process_obj)
275 def _Popen(process_obj):
276 from .popen_fork import Popen
–> 277 return Popen(process_obj)
278
279 class SpawnProcess(process.BaseProcess):

~/.conda/envs/tf-gpu/lib/python3.6/multiprocessing/popen_fork.py in init(self, process_obj)
18 sys.stderr.flush()
19 self.returncode = None
—> 20 self._launch(process_obj)
21
22 def duplicate_for_child(self, fd):

~/.conda/envs/tf-gpu/lib/python3.6/multiprocessing/popen_fork.py in _launch(self, process_obj)
65 code = 1
66 parent_r, child_w = os.pipe()
—> 67 self.pid = os.fork()
68 if self.pid == 0:
69 try:

OSError: [Errno 12] Cannot allocate memory

1 Like

bs might be too big depending on your model config (if you use precompute or not). Can you try 16?

1 Like

@jeremy @sermakarevich
Some update on my local environment, using the following hyper params
sz=224
bs=16
num_workers=4

I manage to run lesson1 but I need to increase my swap memory to 64GB with physical RAM 16GB, finally total 80GB RAM to run this. Will be experimenting by slowly increasing bs and see whether I have same issue later. Thanks again guys !

1 Like

Actually it’s happening in Paperspace as well.
Trying with num_workers=4 and/or bs = 32 hasn’t helped so far.
I keep getting Assertion errors

I did a pull yesterday, but will try again…


AssertionError Traceback (most recent call last)
in ()
1 arch=resnet34
2 data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz),num_workers=4)
----> 3 learn = ConvLearner.pretrained(arch, data, precompute=True)
4 learn.fit(0.01, 3)

~/fastai/courses/dl1/fastai/conv_learner.py in pretrained(self, f, data, ps, xtra_fc, xtra_cut, **kwargs)
90 def pretrained(self, f, data, ps=None, xtra_fc=None, xtra_cut=0, **kwargs):
91 models = ConvnetBuilder(f, data.c, data.is_multi, data.is_reg, ps=ps, xtra_fc=xtra_fc, xtra_cut=xtra_cut)
—> 92 return self(data, models, **kwargs)
93
94 @property

~/fastai/courses/dl1/fastai/conv_learner.py in init(self, data, models, precompute, **kwargs)
83 elif self.metrics is None:
84 self.metrics = [accuracy_multi] if self.data.is_multi else [accuracy]
—> 85 if precompute: self.save_fc1()
86 self.freeze()
87 self.precompute = precompute

~/fastai/courses/dl1/fastai/conv_learner.py in save_fc1(self)
130 self.fc_data = ImageClassifierData.from_arrays(self.data.path,
131 (act, self.data.trn_y), (val_act, self.data.val_y), self.data.bs, classes=self.data.classes,
–> 132 test = test_act if self.data.test_dl else None, num_workers=8)
133
134 def freeze(self): self.freeze_to(-self.models.n_fc)

~/fastai/courses/dl1/fastai/dataset.py in from_arrays(self, path, trn, val, bs, tfms, classes, num_workers, test)
289 @classmethod
290 def from_arrays(self, path, trn, val, bs=64, tfms=(None,None), classes=None, num_workers=4, test=None):
–> 291 datasets = self.get_ds(ArraysIndexDataset, trn, val, tfms, test=test)
292 return self(path, datasets, bs, num_workers, classes=classes)
293

~/fastai/courses/dl1/fastai/dataset.py in get_ds(self, fn, trn, val, tfms, test, **kwargs)
274 res = [
275 fn(trn[0], trn[1], tfms[0], **kwargs), # train
–> 276 fn(val[0], val[1], tfms[1], **kwargs), # val
277 fn(trn[0], trn[1], tfms[1], **kwargs), # fix
278 fn(val[0], val[1], tfms[0], **kwargs) # aug

~/fastai/courses/dl1/fastai/dataset.py in init(self, x, y, transform)
165 def init(self, x, y, transform):
166 self.x,self.y=x,y
–> 167 assert(len(x)==len(y))
168 super().init(transform)
169 def get_x(self, i):

AssertionError:

That’s an unrelated issue. Looks like you need to delete the data/dogscats/tmp folder.

1 Like

thanks, doing removing /tmp folder got me back to the original error

OSError: [Errno 12] Cannot allocate memory

current parameters in Paperspace, basic tier:

data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz),num_workers=4,bs=16)

You may need to restart jupyter. Also, do a git pull since I just totally changed all this code.

Did run across this problem on AWS p2.xlarge (Oregen) with resnet50 with default bs=28, but did not try to reduce it to 14 and don’t know if that will solve the problem …

@neovaldivia Your data got corrupted, so failed to pass test: x and y are not the same length (line 167)

removing the tmp folder got that issue fixed. thanks.

Although with this new code push I’m getting an error when importing fastai libraries.
ImportError: dlopen: cannot load any more object with static TLS

Apparently, it’s the order in which the lib’s are imported…but not sure

Once ssh in, try this to see if it is better: @neovaldivia
$ cd fastai
$ git pull
$ conda env update
$ source activate fastai
$ jupyter notebook

I also first assumed it as the batch size, but later discerned since this parameter is passed to tfms_from_model so it should be in some way related to the transformations. Thanks for clarifying out. More documentation with FastAI library can definitely help with such concerns. I will check out github this weekend.

I am getting similar errors with the Queue.Full error, going to try out with smaller batch sizes and num_workers option cos maybe I am limited by the memory. Here’s the post

Going to try these parameters too, to cope with the memory