Lesson 4 - Non-beginner discussion

This is the topic for any non-beginner discussion around lesson 4. It won’t be actively monitored by Jeremy or I tonight, but we will answer standing questions in here tomorrow (if needed).


What about if we have noisy/soft labels? Then is there an alternative to mean that could be used and places less weight in the case of outliers?

Of course, as mentioned, median wouldn’t work since it is not differentiable.

You can look at label smoothing, which is designed to deal with data that is not perfect. It still uses the mean, but changes the labels.


Thank you. I am aware of label smoothing, but was wondering if adjusting this mean-based loss reduction could be an alternative to, or complement label smoothing when dealing with noisy/soft labels.

There are different forms of weighted loss functions, although they’re mostly used for dealing with class imbalance rather than noisy/soft labels


I guess the problem with trying to deal with that in your loss is that you don’t actually know which labels are the outliers or the noisy ones. In the case of soft labels I guess you do so you could try something there

Has anyone ever tried using a very small dataset to train, or really to seed, a much larger dataset? Suppose I took the weights from Wikipedia, and retrained them by overfitting on a very small dataset. This dataset would be a sample of similar-meaning words and phrases, of variable length, (“isn’t that ridiculous?”, “didn’t that just take the cake”, “now I’ve seen it all”, “preposterous!”). These samples may, as we see here, actually may have no words in common. But what they do have in common is the concept that they all, to one degree or another, embody. We then retrain Wikipedia by reversing the normal flow of ML: Each of these multiphrases would essentially be examples of labeled data. Now, you could try to just use these phrases as labels, and retrain the model to try to recognize them, without the step I’m suggesting. But there’s a problem here: the initial training of Wikipedia was only based on word proximities to one another, having nothing to do with the higher level idea of, well, ‘idea’ or concept: the thing that those word groups have in common; their recognizable pattern. So it is unlikely it would do very well. So what if you then tried tuning the large corpus (wikipedia) and its accompanying model in a way that would inject those weights of meaning into it? You are then using the small dataset to tune the large model. You run a new example that you humanly recognize as having the same meaning as something in the original small dataset.

The problem I’m thinking about is how to make a more direct attempt at getting actual meaning out of a language model, not just close vector-space numbers which reflect the words around a given word.

Check out this paper: https://openreview.net/pdf?id=BJlzm64tDH

You might also like this one as a counterpoint—maybe language models already encode explicit knowledge with the way they are trained https://www.aclweb.org/anthology/D19-1250.pdf

Very interesting, thank you! I’m making my way through both.

Sorry, taking a while. Difficult to prioritize with the amount of stuff to learn. Hard to make my way through these technical articles (he admitted sheepishly)

For the fit methods (eg .fit, .fine_tune, fit_one_cycle), is there a way to turn off the data table plots? I’m running a parameter search script (~100+ runs), and the tables seem to append one another.

I saw there was a parameter in lr_find(show_plot=False), but did not find one in the fit methods.

As an indirect method, I’m currently using IPython.display.clear_output .


1 Like