Lesson 3 official topic

Probably now. If the models are so similar that you need a statistical test to see whether the difference is real, then it’s likely to be not a practically significant difference anway!

Furthermore, a statistical test doesn’t make much sense here - what’s the “true population” that you’re sampling from? Instead, you should test on a range of datasets that are representative of the kind of problems your model is designed to solve.

5 Likes

It’s certainly well worth reading, although it’s not something that I feel video lessons add much to – in this course I’m trying to focus on using code, spreadsheets, etc to explain things in an environment where students can experiment and learn. Chapter 3 is probably best presented as a book chapter, not as a video.

I do plan to touch on concepts from chapter 3 integrated into other lessons, however.

6 Likes

The next lesson will be largely new material.

4 Likes

Thank you very much, Jeremy, for all your help, advices, responses, for this course and etc so far. Much and truly appreciated.
The best lecturer I’ve ever come across by far!

6 Likes

And thank you for your thoughtful questions! :smiley:

2 Likes

Oh BTW you can also pre-read the notebook for the next lesson if you like, which is listed here:

3 Likes

Okay so just trying to understand please, in general…

If we work on a paper, and create different models there as part of my work, then:

To compare my models, it’s okay to just use machine learning metrics (the same metric), for example, if I have 4 models with say accuracy of 70%, 65%, 85%, 95%, then the best one will be the one with 95%. As we know accuracy is generally not the ideal metric due to possible class imbalance etc in general, so then, what would your advice be for: which metrics to choose, how many metrics to choose, and then, if we have multiple metrics, how do you say which is the “best model” from your study in general for your paper? Just would like to get your expert advice on these aspects please?
For example, I’ve read papers and seen authors choose metrics (they almost always include accuracy even when not ideal) and then for example, they’ll say model X is the best model in their study, from all the models they tried, because model X had the highest performance value for say, 3 of the 4 machine learning metrics.

What is your advice on all of this?

Also, one should always report in one’s paper, what other authors have done in their study, I mean, the performance of their models, however, what if the dataset those authors use were different? So does one report other authors work, best models from their published study, only if they used the exact same dataset you are also using? What if you create a new dataset in a different way, so there won’t be previous authors papers to report about in terms of comparing their models, to yours in your paper, although there may have been work done in the same field before… Then is it okay to just speak about the other authors model results in general in the discussion section of your paper, as in the Results part, there won’t be anything to compare against due to the aspect I mentioned above, re: the dataset is newly curated and hasn’t been used by others before.

I guess it’s a case of finding the compromise between keeping it simple, keeping it consistent with others in the field, and also reporting something that you feel comfortable adequately shows the pros and cons of your approach.

For instance, using accuracy is often fine for comparing models, as long as they’re all on the same dataset, since any issues caused by an unbalanced dataset are going to be shared by all of them.

Unless, however, you’re showing off something that’s specifically designed to address dataset imbalance in a special way that’s not adequately shown when using accuracy.

4 Likes

One of my lecturers in the past from campus used to say that one HAS to do statistical tests after creating different machine learning models and getting the machine learning metrics values for all the models.
However, I saw lots of papers that do Not do this, meaning that standard machine learning metrics suffice… So in general in the past because of the above, I was a bit confused.

Glad to have an expert to ask these questions to, now, and happy to be learning and getting advice from you, Jeremy. Really, and sincerely, it means a lot.

Hope it’s fine to ask these questions here.

As I side note. If I remember my statistics textbooks correctly, many (most of?) statistical tests involved relatively small (by modern means) datasets and models with dozens/hundreds of parameters. Some stat books, especially, the older ones, I would say, consider datasets of tiny sizes. I still remember advice from a book to use T-distribution to approximate the normal one for datasets of size N < 30… (Hope I formulated this advice correctly, not a professional statistician, sorry!)

These days we have huge multimillion/billion parameters models and enormous datasets with millions/billions of samples. In this case, we can approximate distributions pretty accurately. So I would say that this time, the law of big numbers is on our side. And some pragmatic metrics, as well as visual interpretation or loss landscapes visualization, help to see more than was possible before big data emerged.

3 Likes

That’s a problem I haven’t solved yet! I tried going with a “cleaner” dataset as there are various mineral databases. However, the problem then involved what to do when the photos contained multiple minerals. That’s something that’ll be covered later and definitely a topic I need a refresher on!

Personally, I understand your lecturer’s sentiment. In some contexts, it’s a great mentality to have. But, in the real world, what Jeremy said is correct; in almost every practical case the difference in p-value or t-test scores will translate to relatively tiny upside in your performance metrics.

Remember that most (though certainly not all) common statistical testing revolves around some concept of the ratio of effect size to sample size. In other words, the effect size you’re observing can either be so large relative to your sample size that the effect is ‘obviously’ non random. Or, more commonly, the observed effect size is small and difficult to detect — but it occurs relatively frequently so that it’s also likely non-random.

So, using that simple heuristic, if the problem itself is trying to find some small-magnitude-but- critical-effect in your data, then (absent changing dimensionality / feature engineering etc) all the various ML methods will just have relatively minor variations in what they can extract from your data in the first place.

This is one of the keys to real world machine learning: distinguishing the forest from the trees. Knowing which elements of your results are proverbial “forest” and which elements are the proverbial “trees” is vital to making sense of results in production and tweaking them.

I’ll also note that multiple-comparisons bias also begins to come into play here. The more methods you test on a dataset the more likely you’re going to find some method that finds some useful result simply by chance alone. Being cognisant of this bias is also critical to real-world success.

4 Likes

I have run into such a problem myself. The challenge is when you build a system which takes the sample images and you train on these images. Small variations in lighting conditions etc will badly throw the model off during test time. Key is to select correct augmentations to capture these variations. I must confess that I was not too successful in this effort. The augmentations which we use, do they really represent the variations that actually occur in the physical world is the moot question to be answered. Any insights on this ?

2 Likes

Using OpenCV as part of augmentation is way to go. Can anyone post an example ?

Thanks Jeremy!

I had a couple of questions based off this lesson that I wasn’t able to fully figure out at least from the class lecture and the text of chapter 4 of the book itself:

  1. In chapter 4 of the book, at the beginning of the section entitled “The MNIST Loss Function”, we transform our rank-3 tensors into rank-2 tensors. As I understood it, we basically arrange all the 784 individual pixel values so that they’re one after another instead of presented as a 28x28 grid. Why do we do that?
  2. In the section entitled “Adding a Nonlinearity” the basic neural network code includes lines like res = xb@w1 + b1. What is res? What does it stand for?

I guess it allows getting a 2D matrix instead of 3D volume. So you have:

(n_samples, h, w) -> (n_samples, h * w)

Then you can compute matrix multiplication with weights.

X.shape == (n_samples, h * w)
W1.shape == (h * w, n_hidden)

# don't remember for sure if a columnar or row tensor, maybe just (N,)
(X @ W1 + b1).shape == (N, 1)

So a 28x28 is flattened into a 784 1-d vector to align shapes and get expected matrix multiplications.

1 Like

Basically, I also think so. The authors probably wanted to encourage the intuition about a data point going through the layer.
If we demount the matrix of a data point into an array, we can easily visualize the dot product of the data point with the layer.
In this convenient way, the dimensions easily match.

3 Likes

Result, probably.

2 Likes

I didn’t fully understand this explanation. Not quite sure what the == and the -> denote.

I guess I had assumed the transformation or conversion into 2D was to make sure the shapes align, but I thought we’d seen a few examples (broadcasting, primarily) where PyTorch was able to do some magic on its side to make non-equivalent dimensions work together. I think I had assumed it’d be able to do it in this case as well?