Does anyone know the current research consensus on the idea of adversarial examples being caused by excessive linearity?
This idea was presented in 2014 here:
In a nutshell, the idea is that modern neural nets are piecewise linear before the final sigmoid/softmax layer. As a result, neural networks basically break space into linear subregions. Within a subregion, the model’s responses are linear with respect to the input.
This suggests that adversarial examples are a result of models linearly extrapolating pixel values to unreasonable levels. The authors show how the logit values of a network show a linear response to varying levels of adversarial noise.
This idea is further elaborated on here:
See the church window plots.
So tl;dr the idea is that excessive linearity in neural networks makes them vulnerable to adversarial noise that exploits this linearity.
However, there’s research from Deepmind:
Deepmind finds that training on adversarial examples (to reduce model vulnerability to those examples) causes the network to behave more linearly. They frame adversarial training as effectively regularizing the curvature of the model’s decision boundaries.
Deepmind notes that their results contradict the first paper I linked, but they don’t give alternative explanations to the evidence that suggests the problem is linearity.
Does anyone know of research that has followed up on this? Here we have two contradictory understandings of the same phenomenon. I’m surprised I can’t find anything directly testing the two hypothesis, but maybe I’m looking in the wrong place.