Lesson 4 - Non-beginner discussion

sgugger · April 15, 2020, 1:43am

This is the topic for any non-beginner discussion around lesson 4. It won’t be actively monitored by Jeremy or I tonight, but we will answer standing questions in here tomorrow (if needed).

ilovescience · April 15, 2020, 2:07am

What about if we have noisy/soft labels? Then is there an alternative to mean that could be used and places less weight in the case of outliers?

Of course, as mentioned, median wouldn’t work since it is not differentiable.

sgugger · April 15, 2020, 2:09am

You can look at label smoothing, which is designed to deal with data that is not perfect. It still uses the mean, but changes the labels.

ilovescience · April 15, 2020, 2:12am

Thank you. I am aware of label smoothing, but was wondering if adjusting this mean-based loss reduction could be an alternative to, or complement label smoothing when dealing with noisy/soft labels.

wdhorton · April 15, 2020, 2:17am

There are different forms of weighted loss functions, although they’re mostly used for dealing with class imbalance rather than noisy/soft labels

wdhorton · April 15, 2020, 2:17am

I guess the problem with trying to deal with that in your loss is that you don’t actually know which labels are the outliers or the noisy ones. In the case of soft labels I guess you do so you could try something there

quantum · April 16, 2020, 7:58am

Has anyone ever tried using a very small dataset to train, or really to seed, a much larger dataset? Suppose I took the weights from Wikipedia, and retrained them by overfitting on a very small dataset. This dataset would be a sample of similar-meaning words and phrases, of variable length, (“isn’t that ridiculous?”, “didn’t that just take the cake”, “now I’ve seen it all”, “preposterous!”). These samples may, as we see here, actually may have no words in common. But what they do have in common is the concept that they all, to one degree or another, embody. We then retrain Wikipedia by reversing the normal flow of ML: Each of these multiphrases would essentially be examples of labeled data. Now, you could try to just use these phrases as labels, and retrain the model to try to recognize them, without the step I’m suggesting. But there’s a problem here: the initial training of Wikipedia was only based on word proximities to one another, having nothing to do with the higher level idea of, well, ‘idea’ or concept: the thing that those word groups have in common; their recognizable pattern. So it is unlikely it would do very well. So what if you then tried tuning the large corpus (wikipedia) and its accompanying model in a way that would inject those weights of meaning into it? You are then using the small dataset to tune the large model. You run a new example that you humanly recognize as having the same meaning as something in the original small dataset.

The problem I’m thinking about is how to make a more direct attempt at getting actual meaning out of a language model, not just close vector-space numbers which reflect the words around a given word.

wdhorton · April 16, 2020, 12:31pm

Check out this paper: https://openreview.net/pdf?id=BJlzm64tDH

wdhorton · April 16, 2020, 12:37pm

You might also like this one as a counterpoint—maybe language models already encode explicit knowledge with the way they are trained https://www.aclweb.org/anthology/D19-1250.pdf

quantum · April 17, 2020, 12:22am

Very interesting, thank you! I’m making my way through both.

quantum · April 18, 2020, 6:28pm

Sorry, taking a while. Difficult to prioritize with the amount of stuff to learn. Hard to make my way through these technical articles (he admitted sheepishly)

DanielLam · April 18, 2020, 7:04pm

For the fit methods (eg .fit, .fine_tune, fit_one_cycle), is there a way to turn off the data table plots? I’m running a parameter search script (~100+ runs), and the tables seem to append one another.

I saw there was a parameter in lr_find(show_plot=False), but did not find one in the fit methods.

As an indirect method, I’m currently using IPython.display.clear_output .

Thanks,
Daniel