Problem with ClassificationInterpretation and ImageClassifierCleaner

bdubreu · April 9, 2020, 1:32pm

Hi there !

ClassificationInterpretation and ImageClassifierCleaner make the kernel crash all the time. The problem is I am not getting any error… The kernel just crashes while the learner iterates through the files…

Any idea why that might be the case ?

edit:
note that the model fine_tunes without any problem, which I believe removes the possibility that a file might be corrupt. Also, I tried changing the batch_size to no avail. Finally, show_results() works without problem as well… very curious… maybe my hardware is screwed somehow ?

bdubreu · April 14, 2020, 10:34am

I’ve kept investigating this a little. I’ve taken 1/10th of the data, tried ClassificationInterpretation and ImageClassifierCleaner and they both worked. So I did that for the rest of the data, 1/10th at a time. Everything worked perfectly…
@sgugger sorry for the at-mentionning, but I believe it is possible that there is something that causes OOM errors with those two things (both have in common that they make predictions on the entire datasets; since the transforms happen on the GPU before prediction (I believe), maybe this is where things crash ?)

edit: also, I tried to use datablock.dataloaders() method with the ‘sample’ and ‘n’ parameters to see if I could take only a few items in the folders without having to reorganize them. Neither seemed to work. Maybe I got their intended use wrong, but in that case maybe something to do just that should exist ( i.e you have two folders with your classes, but would like at first to run a quick model on only ten% of available files for each class).

machinethink · April 14, 2020, 11:05am

If the kernel crashes, it means Python has crashed. Is there an error message in the shell window that you launched Jupyter from?

sgugger · April 14, 2020, 11:31am

If you have lots and lots of pictures, it’s possible you kernel crashes because it goes out of RAM. You should try applying it to a smaller dataloader.

aaerox · May 17, 2020, 4:13pm

This definitely seems to be worse in fastai2 - I suspect due to it storing inputs now too. Altering the function not to store inputs, as I don’t need this, uses orders of magnitude less memory:

@classmethod
def from_learner(cls, learn, ds_idx=1, dl=None, act=None):
    "Construct interpretatio object from a learner"
    if dl is None: dl = learn.dls[ds_idx]
    preds, targs, decoded, losses = learn.get_preds(dl=dl, with_input=False, with_loss=True, with_decoded=True, act=None)
    return cls(dl, [], preds, targs, decoded, losses)

mbjoseph · October 5, 2020, 9:56pm

Thanks @aaerox - that worked for me.

To add a little context: I was noticing that while ClassificationInterpretation.from_learner was running, my RAM usage was slowly increasing until it maxed out and the kernel died.

sipu · November 11, 2020, 1:55pm

While using classificationInterpretation my notebook crashed due to high ram consumption. How to solve this issue.

rbunn80130 · December 2, 2020, 3:11am

I’m confused. If this doesn’t happen during training why would it happen when doing an interpretation on what is probably only 20% of the training data? Does a bug report need to be submitted or is this in a queue already? The reason I ask is I am using the latest version and it appears to have an issue with RAM usage.

aviopene · February 11, 2021, 11:03am

This still happens in Fastai 2.2.5, I can only complete ClassificationInterpretation.from_learner() with 20k images (I’m on a cluster node with 128 Gb of RAM). With just some more images, the process gets killed by OOM reaper. I’ll try @aaerox modification then I’ll open a pull request (if it’s not there yet).

jimmiemunyi · February 11, 2021, 2:42pm

You all should go through this thread, it talks about the same issue:

hushitz · February 11, 2021, 4:28pm

@jimmiemunyi - correct. ImageClassifierCleaner is now fixed in fast.ai repo, but ClassificationInterpretation is outstanding.

@muellerzr is working on a “proper” fix for the Interpretation class adding several nice features, but in the meantime you can patch your own code using the snippets here to fix the problem: Learn.get_preds() memory inefficiency - quick fix

dokuboyejo · August 8, 2021, 4:16am

Issue with Interpretation still happening in fastai v2.4
@muellerzr any idea when the fix would be applied to the main repo

dokuboyejo · August 8, 2021, 5:22pm

Just to add, ClassificationInterpretation.from_learner(learn) in v2.4 and v2.5 currently fails with TypeError: __init__() missing 1 required positional argument: 'losses'

muellerzr · August 8, 2021, 9:34pm

Please provide a reproducer so we know how you are setting things up

dokuboyejo · August 9, 2021, 9:10am

@muellerzr currently using a private paperspace environment (8 CPU, 30GB RAM) over ~53000 image dataset … notebook looks like below

learn_segmented = cnn_learner(data_segmented, torchvision.models.densenet121, metrics=[error_rate, accuracy], model_dir=model_path)
interp_segmented = ClassificationInterpretation.from_learner(learn_segmented)

muellerzr · August 9, 2021, 11:08am

I could not recreate the issue with the pets dataset. Please make sure your data is labelled and that everything looks correct. What I used:

from fastai.vision.all import *

set_seed(99, True)
path = untar_data(URLs.PETS)/'images'
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2,
    label_func=lambda x: x[0].isupper(), item_tfms=Resize(224))

learn = cnn_learner(dls, resnet34, metrics=error_rate)
interp = ClassificationInterpretation.from_learner(learn)

dokuboyejo · August 9, 2021, 12:15pm

So, from class view, I believe the labeling doesn’t have an issue

Initially, the kernel just crash during interpretation. However, after excluding input with_input=False based on above fix, the error TypeError: __init__() missing 1 required positional argument: 'losses' started to happen

muellerzr · August 9, 2021, 12:49pm

Unless you’re using that version I posted, you will have issues. I also don’t know if that version works anymore

dokuboyejo · August 9, 2021, 1:03pm

@muellerzr Tnx, what version do you recommend I can try perhaps

dokuboyejo · August 14, 2021, 10:43pm

@muellerzr I have now tried v2.5.1, with below fix and it worked ok

   @classmethod
   def from_learner(cls, learn, ds_idx=1, dl=None, act=None):
        "Construct interpretation object from a learner"
        if dl is None: dl = learn.dls[ds_idx].new(shuffled=False, drop_last=False)
        # return cls(dl, *learn.get_preds(dl=dl, with_input=True, with_loss=True, with_decoded=True, act=None))
        return cls(dl, dl.dataset.items, *learn.get_preds(dl=dl, with_input=False, with_loss=True, with_decoded=True, act=None))

   # new Interpretation.plot_top_losses in fastai/interpret.py
   def plot_top_losses(self, k, largest=True, **kwargs):
        losses,idx = self.top_losses(k, largest)
        if not isinstance(self.inputs, tuple): self.inputs = (self.inputs,)
        if isinstance(self.inputs[0], Tensor): inps = tuple(o[idx] for o in self.inputs)
        # else: inps = self.dl.create_batch(self.dl.before_batch([tuple(o[i] for o in self.inputs) for i in idx]))
        else: inps = (first(to_cpu(self.dl.after_batch(to_device(first(self.dl.create_batches(idx)))))),)
        b = inps + tuple(o[idx] for o in (self.targs if is_listy(self.targs) else (self.targs,)))
        x,y,its = self.dl._pre_show_batch(b, max_n=k)
        b_out = inps + tuple(o[idx] for o in (self.decoded if is_listy(self.decoded) else (self.decoded,)))
        x1,y1,outs = self.dl._pre_show_batch(b_out, max_n=k)
        if its is not None:
            plot_top_losses(x, y, its, outs.itemgot(slice(len(inps), None)), self.preds[idx], losses,  **kwargs)
        #TODO: figure out if this is needed
        #its None means that a batch knows how to show itself as a whole, so we pass x, x1
        #else: show_results(x, x1, its, ctxs=ctxs, max_n=max_n, **kwargs)