Learn.get_preds() running out of RAM at completion

mindtrinket · May 31, 2019, 6:44pm

I believe during get_preds I am running out of memory (62.6 gigs). Not sure if it is just a spike in memory usage at the end but does anyone have an idea how to fix?

I got the error [enforce fail at CPUAllocator.cpp:56] when I looked it up on google it appears that there is a problem with CPU memory. Sure enough, I was able to see the problem right at the end. I am using the predictions with nmslib to create a KNN for similar images so I would like to keep all the data. Not sure why it punted to CPU, but my GPUs have less than 51G, so I don’t see that as a way to solve.

I think my hacky fix would be to reduce the number of files I am predicting against.

Screenshot of running out of memory at the end:

Full error below:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-24-9aa6c3114742> in <module>
----> 1 predictions = learn.get_preds()

~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in get_preds(self, ds_type, with_loss, n_batch, pbar)
    335         lf = self.loss_func if with_loss else None
    336         return get_preds(self.model, self.dl(ds_type), cb_handler=CallbackHandler(self.callbacks),
--> 337                          activ=_loss_func2activ(self.loss_func), loss_func=lf, n_batch=n_batch, pbar=pbar)
    338 
    339     def pred_batch(self, ds_type:DatasetType=DatasetType.Valid, batch:Tuple=None, reconstruct:bool=False) -> List[Tensor]:

~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in get_preds(model, dl, pbar, cb_handler, activ, loss_func, n_batch)
     42     "Tuple of predictions and targets, and optional losses (if `loss_func`) using `dl`, max batches `n_batch`."
     43     res = [torch.cat(o).cpu() for o in
---> 44            zip(*validate(model, dl, cb_handler=cb_handler, pbar=pbar, average=False, n_batch=n_batch))]
     45     if loss_func is not None:
     46         with NoneReduceOnCPU(loss_func) as lf: res.append(lf(res[0], res[1]))

~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in <listcomp>(.0)
     41               activ:nn.Module=None, loss_func:OptLossFunc=None, n_batch:Optional[int]=None) -> List[Tensor]:
     42     "Tuple of predictions and targets, and optional losses (if `loss_func`) using `dl`, max batches `n_batch`."
---> 43     res = [torch.cat(o).cpu() for o in
     44            zip(*validate(model, dl, cb_handler=cb_handler, pbar=pbar, average=False, n_batch=n_batch))]
     45     if loss_func is not None:

RuntimeError: [enforce fail at CPUAllocator.cpp:56] posix_memalign(&data, gAlignment, nbytes) == 0. 12 vs 0

noamholz · July 15, 2019, 9:08am

Thanks for sharing this.
It seems I have the same issue.
I can see the CPU RAM usage increasing during the prediction (getting relatively close to its limit), but it crashes shortly after the predictions are done (progress bar at 100%).
I run it on a Kaggle kernel, so it just crashes without giving me the error details though.

chetanambi · July 30, 2019, 4:46pm

I am also running into this error when using in Kaggle kernel. Anybody found the solution for this issue?

noamholz · July 30, 2019, 6:19pm

The RAM exhaustion might have a more fundamental cause than the workings of get_preds().
It might be that simply storing the results of get_preds() takes lots of memory (large dataset in terms of bytes).
I eventually resorted to applying my results analysis per batch without storing it all together, something of the sort:

val_batch_iter = iter(data.valid_dl)
for n in tqdm(range(int(len(data.valid_ds) / learn.data.batch_size))):
    batch = next(val_batch_iter)
    preds_tup = learn.pred_batch(batch=batch)

where I do whatever I need with the preds_tup tuple, within the loop, and store some final results that accumulate along the way.

Hope this can help in your case too.

jsxyhelu · April 8, 2022, 12:30pm

‘DynamicUnet’ object has no attribute ‘pred_batch’