Research question / idea

Here’s a question, perhaps not for deep learning but let’s say you trained a traditional neural network on MNIST data and then you took the hidden layers converted them into something like a heatmap or rich image and trained people to identify the hidden layer with the output.

My hypothesis is that people would be able to learn to identify the abstract hidden layer image and correlation to output.

My question though is whether or not they would be able to identify the images that mess up the neural network? Also, would a random noise image that fools a deep learner fool a human looking “through the hidden layers” or would they see noise?

“heatmap or rich image and trained people to identify the hidden layer with the output.” Are you talking about the activation maps? Well, there are numerous activation maps in any layer, much more than 10, and it will be a slow process of identifying correlations. But since the numbers have so many common elements, it will be long process. This can’t be done for many more classes than 10 as neurons do not perfectly separate information into distinct features. There is co-adapation and instances where multiple neurons work together in different combinations making the correction almost impossible to figure out manually. Look at experiment where the trim they network at a particular layer and attach a svm/classifer to determine the class just from that layer. The accuracy for starting layers is very low and later layers reach around 70-80%.

As for you send note about noise. Yes they will fool the human. As the primary thing a adversarial attack does is distort the information in layer 1, once it does this all subsequent layers report incorrect features as they are based on inferences of layer 1. So you activation map of the n-th layer will indeed show features of a 9 when its a 0. The noise or attack will distort the edge at the pixel level, so the layer 1 activations such as smooth curves for 0 will never get detected.

1 Like