Lesson 11 Discussion

I am reading the paper discussed in class
Training Deep Neural Networks on Noisy Labels with Bootstrapping

I am having trouble understanding equation 6. What is L_soft? Is it the loss? The paper says regression targets. If it is a regression target, then shouldn’t it change for each class (one hot encoded)?

Can someone explain how it is computed?

1 Like

Sure. Glad you’re checking out that paper! The notation is introduced all over the place in it, so to find out what things are, you’ll need to search around a bit!

So here’s the equation:

q are the predicted probabilities. t are the actual labels. And inside the sum() these are being indexed as t[k] and q[k]. If you replace the bit I highlighted with just q[k], then you have the standard cross-entropy loss, which we’ve used for nearly all of our classification models (and we have an XL spreadsheet showing it).

So we’re simply creating a new function which replaces the label, q[k], with a mix of a bit of the prediction t[k] and a bit of the true label q[k] (using a parameter beta which the paper says they set to 0.8).

So basically this is a lot like pseudo-labeling, except that it’s happening for the labeled data, rather than unlabeled.


A question for the babi-memnn notebook:

In the paper “End-To-End Memory Network”, the query embedding is added to before both Softmax layer:

The same relationship is shown in the diagram:

But in the notebook, the query embeding emb_q is merged only with emb_story, but not with emb_c:

Anybody knows why we are skipping the query embedding before 2nd Softmax here?

BTW, another difference I noticed is that, in the paper “+” (sum) is used before the 2nd Softmax, while in our notebook ‘dot’ is used. It seems that different architectures lead to similar results.


Hmmm. I wonder if I made a mistake… Did you try changing this? Did it make it better or worse?

I think that might be doing the same thing, since it’s just a vector. I haven’t checked carefully though… Are you sure it’s different here?

So, if I have to implement bootstrapping in Keras, do I have to explicitly relabel examples in each minibatch? Or can I implement a custom loss function to handle it?

I added the emb_q before Softmax like this:

The result from the original model:

The result from the above modified model:

It looks to me that it does not make a big difference. I trained them a few times, sometimes one is a bit better than another, but overall the result is quite similar.

I will try it on two hops later to see whether it makes a difference.

The result for 2 hops:

Original model:

Modified model:

Quite comparable to me as well.

Classes not present in ImageNet: Any insights on how are we able to find images with w2v classes which are not defined in imagenet e.g. net and rod (Lesson 11 video, 7:45m)?

OK, so now try changing to the ‘two supporting facts’ dataset, and use multiple hops (you can just uncomment the relevant line at the top of the notebook). That would be interesting, since I had a lot of trouble getting that to fit.

Well, in two facts multi-hop case the modified model performed a lot worse. The original model is actually not too bad:

Here is the modified result:

Migrated old homework code off of AWS instance and to my own deep learning server which I built this week (yay!) but running into new issues getting the code the run that didn’t happen before. Would appreciate some ideas / help debugging.

With DCGAN.ipynb, getting this error:

With wgan-pytorch.ipynb getting this error:

Ideas on how to fix?

Your python is too old - needs 3.6

Hi @thunderingtyphoons,
can a soft_loss look something like this in keras ?

Note: the complete collection of Part 2 video timelines is available in a single thread for keyword search.
Part 2: complete collection of video timelines

Lesson 11 video timeline:

00:00:30 Tips on using notebooks and reading research papers

00:03:15 Follow-up on lesson 10 and more word-to-image searches

00:07:30 Linear algebra cheat sheet for deep learning (student’s post on Medium)
& Zero-Shot Learning by Convex Combination of Semantinc Embeddings (arXiv)

00:10:00 Systematic evaluation of CNN advances on ImageNet (arXiv)
ELU better than RELU, learning rate annealing, different color transformations,
Max pooling vs Average pooling, learning rate & batch size, design patterns.

00:27:15 Data Science Bowl 2017 (Cancer Diagnosis) on Kaggle

00:36:30 DSB 2017: full preprocessing tutorial, + others.

00:48:30 A non-deep-learning approach to find lung nodules (research)

00:53:00 Clustering (and why Jeremy wasn’t a fan before)

01:08:00 Using Pytorch with GPU for ‘meanshift’ (clustering cont.)

01:22:15 Candidate Generation and LUNA 16 (Kaggle)

01:26:30 Accelerating K-Means on GPU via CUDA (research)

01:27:15 ChatBots ! (long section)
Starting with “memory networks” at Facebook (research)

01:57:30 Recurrent Entity Networks: an exciting area of research in Memory Networks

01:58:45 Concept of “Attention” and “Attentional Models”

I played around to improve the mean shift algorithm execution speed. Here the link:


For the paper “Training Deep Neural Networks on Noisy Labels with Bootstrapping” there appears to be an implementation in the tensorflow/models repo that was contributed along with the “object_detection” model in tensorflow/models#1561.

Here’s a link to the loss implementation class BootstrappedSigmoidClassificationLoss, including citation!

1 Like

Thanks for linking that code! Do you understand why they perform a sigmoid on the prediction_tensor?

Is there any keras or pytorch implementation of this paper available to the public?