Debunking the Neural Network as "Black Box" View

In the first lecture Jeremy mentions, in his list of misconceptions about Deep Learning, the idea that Neural Networks are Black Boxes. This resonated with me deeply, particularly because I am currently pursuing a business opportunity, where the client is leery of using Deep Learning “because it is difficult to know why a Neural Net made a particular decision, or which data features where most important in making that decision.”

In doing a quick bit or research on the topic, I came across this excellent article outlining the “Information Bottleneck Theory” of Neural Network decision making. I also came across the paper " Opening the black box of neural nets:case studies in stop/top discrimination" out of the Physics Dept. at Harvard.

The paper is a long one, but the aim of the approach seems to be to make “contour maps of [a] network’s classification function.” In light of all this, I was just wondering if anyone had any insights as where Jeremy was headed with this line of thought, as well as knowledge about features of the new library that might help one better understand a model’s decision making process.


I’ve moved this to the advanced category.

Same here - the black-box view has always bothered me, mainly because I like understanding how things work, and secondarily because it seems to encourage a “magical thinking” mindset. (On the other hand, viewing deep learning as a general-purpose tool makes it easier to think of potential applications for it.)

@r2d2 came up with a PCA approach for inspecting the final features of a model, similar to the convolution visualizations that Jeremy showed in the lecture: . I have been trying it out on my data, and it’s very cool!

On the information bottleneck theory: I wonder how it applies to the transfer-learning type of training which we’re doing in lesson 1? The Imagenet pretrained model must have reached the “compression phase” of training (stage D in the last figure in the article), since general features of real-world photographs have been compressed and can generalize to many different image datasets. But when we do the training on a specific dataset, does it run through all the stages again (fitting, then the phase change to compression), perhaps just in the final layers and on a smaller scale? What about when we unfreeze the rest of the layers? Lots to think about!


My advice. Read this book. It will be well worth your time. It covers model specific and agnostic methods.


The PCA approach is fascinating, thanks for sharing it! As to the bottleneck theory, I would agree that the pretrained model must have reached the “compression phase”. I would guess it also reached the (final stage) for the total dataset it was initially trained on.

In the approach we have been using (training a final layer in regards to new data), it would seem that we are “starting” with Stage E training. When we unfreeze the rest of the layers, and engage in additional training, it would seem that we are “going back in time” to somewhere in the Stage B - E steps. I would hazard to guess that the emphasis though is on Stage D in the “unfrozen” steps of transfer learning.

You might find this post and repo interesting. It describes Shap values and gave me a compelling reason to look at them for model interpretation. I really like that you can get high level interpretation over your dataset as well as interpretation at the individual sample level. If you look at the repo it looks like they are in the process of adding support for pytorch, but in the mean time it looks like you could use Kernel SHAP?


I think this is an interesting topic, and interpretable ML, especially in the deep-learning space is a big topic of interest. However, to play devil’s advocate, I think your customer is rightfully wary of deep learning / neural network architectures. There are a lot of interesting papers out there that look at some of the non-intuitive deficiencies of CNN based architectures, that we tend to feel we understand fairly well.

For example, we all tend to believe that CNNs are very good at handling translation in images due to the convolutional nature of the networks. Yet, this paper is eye-opening in terms of how modern architectures can still be extremely deficient in being invariant to small pixel transformations -

Similarly, the single-pixel attack paper from last year was pretty interesting -

There is no doubt that deep-learning can be very powerful. But, when business decisions / personnel decisions need to be made based on recommendations… you need a very clear way to be able to reason about why the action you are taking makes sense. Image classification is a space where being wrong, or making silly misclassifications as outlined in the papers above, is not going to make or break the end application. A user is not going to be terribly unhappy if only 90% of his/her pictures are correctly tagged with their ID. However, being wrong 10% of the time in a self-driving system is a massive problem.

I think DL has a lot of uses in many application spaces today, but in mission-critical applications, or ones where you can’t afford a glitch in neural network “logic” to provide nonsensical results, you should rightfully be wary of DL and look towards utilizing simpler and more interpretable models. You’d be amazed at just how well, things like logistic regression can perform in the real-world. For example:

This is pretty much the opposite of what I’ve been seeing for the last few years. Generally DL models will be simpler because you don’t need as much feature engineering, and they will generally be more interpretable because they’re more accurate. Logistic regression is just a neural net with no hidden layers. There are not that many problems where zero is the appropriate number of hidden layers.

You’re not going to get a 90% accurate object detection system in your self-driving car using logistic regression.

1 Like

I don’t quite understand how a more accurate DL model makes it more interpretable, but perhaps we mean different things when we use that term? The paper I linked above titled " Why do deep convolutional networks generalize so poorly to small image transformations?" ( was published in May, 2018 and evaluates pretty recent architectures and still shows very disparate model outputs for perceptually identical images. To me, explaining “why” is very difficult beyond stating that a combination of convolutions/multiplications involved in the architecture just results in disparate numeric answers. So that feels like a lack of interpretability in the model.

I’m not trying to knock DL or discourage folks from it (I wouldn’t be here otherwise!), but I think a certain amount of thought needs to go into whether it is the best solution and it is also good to retain some skepticism and be aware of the kinds of things that can go wrong with them. To the original point in my post, most people (including myself) feel that CNNs have excellent insensitivity to translation of objects in an image, and yet the paper above clearly shows that that intuition is a bit flawed and can result in very surprising outcomes. Being aware of these types of potential gotchas is always a good thing imho.

Fair enough, but I wasn’t trying to say that logistic regression was suited for self-driving cars. I was trying to (perhaps inarticulately) point out that deep-learning is not the be all and end all of all ML and in certain domains/problems, simpler, more interpretable models combined with domain knowledge + feature engineering can still provide better or at least similar performance while still being far more easy to interpret… the Electronic Health Records paper from Google seems to be a decent example of that -

I don’t find these papers that use synthetic data that interesting. In practice CNNs generalize very well. There are a lot of great resources explaining the myriad of methods used for model interpretation - take a look at some of the recent workshops on the topic if you’re interested.

I don’t think such a straw-man argument is that useful to debate. If anyone actually makes that claim, feel free to discuss it with them.

While I agree regarding papers like the single pixel attack, or some of the ones showing images that look like white noise being classified with high confidence, I don’t think the paper I linked falls in the same category. I’ve applied a few different state-of-the-art object detection + classification architectures in a couple of applications on live video over several months and I have first-hand seen the type of issue described in that paper where two frames that look perceptually identical have considerably different classifier confidences. While I’m sure there are ways to design around this, I found it very illuminating to understand why the networks were behaving as I’d observed.

I’d definitely be interested. Could you point me in the right direction regarding which workshops to check out?

Yes these papers are very useful for helping understand these issues, but I don’t think they tell us anything about how important the issues are in practice. Only real experiments on real data in real contexts can do that.

Here’s a few to get you started


Interesting paper from folks at MIT related to this issue:

This also got some coverage in the press:

The results from the paper are really interesting. Certainly highlights the need for a lot more work in this space.