Hi All,
I am getting the following error when trying to run the fastai model inside the docker. There are limitations on the amount of memory one can use. See the error below:
-Tools presence detection task started.
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/algorithm/process.py", line 255, in <module>
Surgtoolloc_det().process()
File "/home/algorithm/.local/lib/python3.10/site-packages/evalutils/evalutils.py", line 183, in process
self.process_cases()
File "/home/algorithm/.local/lib/python3.10/site-packages/evalutils/evalutils.py", line 191, in process_cases
self._case_results.append(self.process_case(idx=idx, case=case))
File "/opt/algorithm/process.py", line 157, in process_case
scored_candidates = self.predict(case['path']) #video file > load evalutils.py
File "/opt/algorithm/process.py", line 207, in predict
tta_res.append(learn.get_preds(dl=learn.dls.test_dl(fs)))
File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 290, in get_preds
self._do_epoch_validate(dl=dl)
File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 236, in _do_epoch_validate
with torch.no_grad(): self._with_events(self.all_batches, 'validate', CancelValidException)
File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 193, in _with_events
try: self(f'before_{event_type}'); f()
File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 199, in all_batches
for o in enumerate(self.dl): self.one_batch(*o)
File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 227, in one_batch
self._with_events(self._do_one_batch, 'batch', CancelBatchException)
File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 193, in _with_events
try: self(f'before_{event_type}'); f()
File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 205, in _do_one_batch
self.pred = self.model(*self.xb)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/fastai/vision/learner.py", line 177, in forward
def forward(self,x): return self.model.forward_features(x) if self.needs_pool else self.model(x)
File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/convnext.py", line 353, in forward_features
x = self.stages(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/convnext.py", line 210, in forward
x = self.blocks(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/convnext.py", line 148, in forward
x = self.mlp(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/layers/mlp.py", line 29, in forward
x = self.drop1(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1172, in __getattr__
def __getattr__(self, name: str) -> Union[Tensor, 'Module']:
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 371) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
I have set the n_workers=0
and bs=1
but I am still getting the error. These are the ones I could think of. Any ideas or tips to resolve this issue?
Thanks in advance
Best Regards,
Bilal