Tips to reduce memory requirements for fastai model during inference inside docker

bilalUWE · August 30, 2022, 11:33pm

Hi All,

I am getting the following error when trying to run the fastai model inside the docker. There are limitations on the amount of memory one can use. See the error below:

-Tools presence detection task started.
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/algorithm/process.py", line 255, in <module>
    Surgtoolloc_det().process()
  File "/home/algorithm/.local/lib/python3.10/site-packages/evalutils/evalutils.py", line 183, in process
    self.process_cases()
  File "/home/algorithm/.local/lib/python3.10/site-packages/evalutils/evalutils.py", line 191, in process_cases
    self._case_results.append(self.process_case(idx=idx, case=case))
  File "/opt/algorithm/process.py", line 157, in process_case
    scored_candidates = self.predict(case['path']) #video file > load evalutils.py
  File "/opt/algorithm/process.py", line 207, in predict
    tta_res.append(learn.get_preds(dl=learn.dls.test_dl(fs)))
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 290, in get_preds
    self._do_epoch_validate(dl=dl)
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 236, in _do_epoch_validate
    with torch.no_grad(): self._with_events(self.all_batches, 'validate', CancelValidException)
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 193, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 199, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 227, in one_batch
    self._with_events(self._do_one_batch, 'batch', CancelBatchException)
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 193, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 205, in _do_one_batch
    self.pred = self.model(*self.xb)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/fastai/vision/learner.py", line 177, in forward
    def forward(self,x): return self.model.forward_features(x) if self.needs_pool else self.model(x)
  File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/convnext.py", line 353, in forward_features
    x = self.stages(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/convnext.py", line 210, in forward
    x = self.blocks(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/convnext.py", line 148, in forward
    x = self.mlp(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/layers/mlp.py", line 29, in forward
    x = self.drop1(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1172, in __getattr__
    def __getattr__(self, name: str) -> Union[Tensor, 'Module']:
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 371) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

I have set the n_workers=0 and bs=1 but I am still getting the error. These are the ones I could think of. Any ideas or tips to resolve this issue?

Thanks in advance

Best Regards,
Bilal

muellerzr · August 30, 2022, 11:48pm

How are you launching the docker image? Generally you should be able to raise the shard memory space of the docker image when launching it.

E.g. you can pass:

--shm-size=24G

bilalUWE · August 30, 2022, 11:54pm

I am using the following command:

# Do not change any of the parameters to docker run, these are fixed
docker run --rm \
        --memory="${MEM_LIMIT}" \
        --memory-swap="${MEM_LIMIT}" \
        --network="none" \
        --cap-drop="ALL" \
        --security-opt="no-new-privileges" \
        --shm-size="256m" \
        --pids-limit="256" \
        -v $SCRIPTPATH/test/:/input/ \
        -v surgtoolloc_trial-output-$VOLUME_SUFFIX:/output/ \
        surgtoolloc_trial

You are right that increasing the shm-size resolves this error. But is there any other way to reduce memory needs for the model in case I am not allowed to make these changes.

I greatly appreciate your tip. Thanks.