CNN Interpretation with CAM, wrong shapes using ResNet-50

Redevil · June 23, 2021, 5:28pm

Hi everybody,
I was following CNN Interpretation with CAM, in order to use activations maps with ResNet-50, rather than ResNet-34 as in the tutorial. Images have shape 32x32.
When I go through this step:
cam_map = torch.einsum('ck,kij->cij', learn.model[1][-1].weight, act)
I get this error:
RuntimeError: einsum(): operands do not broadcast with remapped shapes [original->remapped]: [10, 512]->[10, 1, 1, 512] [2048, 1, 1]->[1, 1, 1, 2048]
How could I calculate the corect shapes in this case?

zhihenx · June 25, 2021, 7:39am

Facing the same problem

NathanHub · June 25, 2021, 12:41pm

This is because in the case of ResNet-50, the convolutional part outputs a vector of length 2048, which now cannot be used as is in the einsum, which expects the same length as the final layer, i.e. 512. There are two ways I can think of for you to solve the problem.

Either you change the position of the hook to another convolutional layer, and use a layer that has 512 output channels. This will probably not be as accurate as using the final layer as you now rely on lower level features to plot the CAM.
Don’t use einsum at all, instead plot the average of the activations given by the hook. The main advantage of the einsum method is that you will have as many cam_map as you have classes, so you can plot them separately (i.e. which part of the image makes my model think it is from class X). By doing an average, the resulting image you will get shows: which part of the image made my model predict what it predicted.

The second option is what I would advise you. To do it, just replace cam_map[1] in the plotting function by act.mean(0).

Hope it helps !