Classification Interpretation: Not enough memory (RAM)

I am training a classifier on Quick Draw data subset (340,000 samples). I am using this line to get most confused classes:

interp.most_confused()

However, this command raises an error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-26-06171ec66e30> in <module>
----> 1 interp.most_confused()

~/code/fastai_v1/repo/fastai/vision/learner.py in most_confused(self, min_val)
    117     def most_confused(self, min_val:int=1)->Collection[Tuple[str,str,int]]:
    118         "Sorted descending list of largest non-diagonal entries of confusion matrix"
--> 119         cm = self.confusion_matrix()
    120         np.fill_diagonal(cm, 0)
    121         res = [(self.data.classes[i],self.data.classes[j],cm[i,j])

~/code/fastai_v1/repo/fastai/vision/learner.py in confusion_matrix(self)
     92         "Confusion matrix as an `np.ndarray`."
     93         x=torch.arange(0,self.data.c)
---> 94         cm = ((self.pred_class==x[:,None]) & (self.y_true==x[:,None,None])).sum(2)
     95         return to_np(cm)
     96 

RuntimeError: $ Torch: not enough memory: you tried to allocate 36GB. Buy new RAM! at /opt/conda/conda-bld/pytorch-nightly_1539863931710/work/aten/src/TH/THGeneral.cpp:204

The reason is that this line creates a matrix with shape (340, 340000) due to broadcasting:

self.pred_class==x[:,None]

And, then it creates another matrics with the same size:

self.y_true==x[:,None,None]

Therefore, my question is, could we somehow compute this thing iteratively instead of broadcasting? Probably the current version of ClassificationInterpretation class is not too scalable?

And, as a general question, how do you usually compute metrics for huge datasets? Or is it unreasonable to carry out such kind of analysis for big datasets?

4 Likes

Hi Ilia,

same error here! It is a bit weird that even so huge matrices required 36GB, isn’t it?

update: I’ve tried to break up the code into separate lines.
it turned out, that problem occurs when applying .sum(2) to the tensor.

It is either pytorch memory leak or something that I don;t yen understand

created an issue https://github.com/pytorch/pytorch/issues/13296

implemented a fix locally, will do a pull request soon

1 Like

cross-post to a discussion Iterative computations of confusion matrix
pull request: https://github.com/fastai/fastai/pull/1022

2 Likes

Hi Vitaliy, that’s great! :tada:

For a future reference, an iterative computation of confusion matrix was introduced in PR #1022.