Tips to reduce memory requirements for fastai model during inference inside docker

Hi All,

I am getting the following error when trying to run the fastai model inside the docker. There are limitations on the amount of memory one can use. See the error below:

-Tools presence detection task started.
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/algorithm/process.py", line 255, in <module>
    Surgtoolloc_det().process()
  File "/home/algorithm/.local/lib/python3.10/site-packages/evalutils/evalutils.py", line 183, in process
    self.process_cases()
  File "/home/algorithm/.local/lib/python3.10/site-packages/evalutils/evalutils.py", line 191, in process_cases
    self._case_results.append(self.process_case(idx=idx, case=case))
  File "/opt/algorithm/process.py", line 157, in process_case
    scored_candidates = self.predict(case['path']) #video file > load evalutils.py
  File "/opt/algorithm/process.py", line 207, in predict
    tta_res.append(learn.get_preds(dl=learn.dls.test_dl(fs)))
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 290, in get_preds
    self._do_epoch_validate(dl=dl)
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 236, in _do_epoch_validate
    with torch.no_grad(): self._with_events(self.all_batches, 'validate', CancelValidException)
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 193, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 199, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 227, in one_batch
    self._with_events(self._do_one_batch, 'batch', CancelBatchException)
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 193, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/opt/conda/lib/python3.10/site-packages/fastai/learner.py", line 205, in _do_one_batch
    self.pred = self.model(*self.xb)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/fastai/vision/learner.py", line 177, in forward
    def forward(self,x): return self.model.forward_features(x) if self.needs_pool else self.model(x)
  File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/convnext.py", line 353, in forward_features
    x = self.stages(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/convnext.py", line 210, in forward
    x = self.blocks(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/convnext.py", line 148, in forward
    x = self.mlp(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/algorithm/.local/lib/python3.10/site-packages/timm/models/layers/mlp.py", line 29, in forward
    x = self.drop1(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1172, in __getattr__
    def __getattr__(self, name: str) -> Union[Tensor, 'Module']:
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 371) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

I have set the n_workers=0 and bs=1 but I am still getting the error. These are the ones I could think of. Any ideas or tips to resolve this issue?

Thanks in advance

Best Regards,
Bilal

How are you launching the docker image? Generally you should be able to raise the shard memory space of the docker image when launching it.

E.g. you can pass:

--shm-size=24G

I am using the following command:

# Do not change any of the parameters to docker run, these are fixed
docker run --rm \
        --memory="${MEM_LIMIT}" \
        --memory-swap="${MEM_LIMIT}" \
        --network="none" \
        --cap-drop="ALL" \
        --security-opt="no-new-privileges" \
        --shm-size="256m" \
        --pids-limit="256" \
        -v $SCRIPTPATH/test/:/input/ \
        -v surgtoolloc_trial-output-$VOLUME_SUFFIX:/output/ \
        surgtoolloc_trial

You are right that increasing the shm-size resolves this error. But is there any other way to reduce memory needs for the model in case I am not allowed to make these changes.

I greatly appreciate your tip. Thanks.