Lesson 3 official topic

Okay so just trying to understand please, in general…

If we work on a paper, and create different models there as part of my work, then:

To compare my models, it’s okay to just use machine learning metrics (the same metric), for example, if I have 4 models with say accuracy of 70%, 65%, 85%, 95%, then the best one will be the one with 95%. As we know accuracy is generally not the ideal metric due to possible class imbalance etc in general, so then, what would your advice be for: which metrics to choose, how many metrics to choose, and then, if we have multiple metrics, how do you say which is the “best model” from your study in general for your paper? Just would like to get your expert advice on these aspects please?
For example, I’ve read papers and seen authors choose metrics (they almost always include accuracy even when not ideal) and then for example, they’ll say model X is the best model in their study, from all the models they tried, because model X had the highest performance value for say, 3 of the 4 machine learning metrics.

What is your advice on all of this?

Also, one should always report in one’s paper, what other authors have done in their study, I mean, the performance of their models, however, what if the dataset those authors use were different? So does one report other authors work, best models from their published study, only if they used the exact same dataset you are also using? What if you create a new dataset in a different way, so there won’t be previous authors papers to report about in terms of comparing their models, to yours in your paper, although there may have been work done in the same field before… Then is it okay to just speak about the other authors model results in general in the discussion section of your paper, as in the Results part, there won’t be anything to compare against due to the aspect I mentioned above, re: the dataset is newly curated and hasn’t been used by others before.

I guess it’s a case of finding the compromise between keeping it simple, keeping it consistent with others in the field, and also reporting something that you feel comfortable adequately shows the pros and cons of your approach.

For instance, using accuracy is often fine for comparing models, as long as they’re all on the same dataset, since any issues caused by an unbalanced dataset are going to be shared by all of them.

Unless, however, you’re showing off something that’s specifically designed to address dataset imbalance in a special way that’s not adequately shown when using accuracy.

4 Likes

One of my lecturers in the past from campus used to say that one HAS to do statistical tests after creating different machine learning models and getting the machine learning metrics values for all the models.
However, I saw lots of papers that do Not do this, meaning that standard machine learning metrics suffice… So in general in the past because of the above, I was a bit confused.

Glad to have an expert to ask these questions to, now, and happy to be learning and getting advice from you, Jeremy. Really, and sincerely, it means a lot.

Hope it’s fine to ask these questions here.

As I side note. If I remember my statistics textbooks correctly, many (most of?) statistical tests involved relatively small (by modern means) datasets and models with dozens/hundreds of parameters. Some stat books, especially, the older ones, I would say, consider datasets of tiny sizes. I still remember advice from a book to use T-distribution to approximate the normal one for datasets of size N < 30… (Hope I formulated this advice correctly, not a professional statistician, sorry!)

These days we have huge multimillion/billion parameters models and enormous datasets with millions/billions of samples. In this case, we can approximate distributions pretty accurately. So I would say that this time, the law of big numbers is on our side. And some pragmatic metrics, as well as visual interpretation or loss landscapes visualization, help to see more than was possible before big data emerged.

3 Likes

That’s a problem I haven’t solved yet! I tried going with a “cleaner” dataset as there are various mineral databases. However, the problem then involved what to do when the photos contained multiple minerals. That’s something that’ll be covered later and definitely a topic I need a refresher on!

Personally, I understand your lecturer’s sentiment. In some contexts, it’s a great mentality to have. But, in the real world, what Jeremy said is correct; in almost every practical case the difference in p-value or t-test scores will translate to relatively tiny upside in your performance metrics.

Remember that most (though certainly not all) common statistical testing revolves around some concept of the ratio of effect size to sample size. In other words, the effect size you’re observing can either be so large relative to your sample size that the effect is ‘obviously’ non random. Or, more commonly, the observed effect size is small and difficult to detect — but it occurs relatively frequently so that it’s also likely non-random.

So, using that simple heuristic, if the problem itself is trying to find some small-magnitude-but- critical-effect in your data, then (absent changing dimensionality / feature engineering etc) all the various ML methods will just have relatively minor variations in what they can extract from your data in the first place.

This is one of the keys to real world machine learning: distinguishing the forest from the trees. Knowing which elements of your results are proverbial “forest” and which elements are the proverbial “trees” is vital to making sense of results in production and tweaking them.

I’ll also note that multiple-comparisons bias also begins to come into play here. The more methods you test on a dataset the more likely you’re going to find some method that finds some useful result simply by chance alone. Being cognisant of this bias is also critical to real-world success.

4 Likes

I have run into such a problem myself. The challenge is when you build a system which takes the sample images and you train on these images. Small variations in lighting conditions etc will badly throw the model off during test time. Key is to select correct augmentations to capture these variations. I must confess that I was not too successful in this effort. The augmentations which we use, do they really represent the variations that actually occur in the physical world is the moot question to be answered. Any insights on this ?

2 Likes

Using OpenCV as part of augmentation is way to go. Can anyone post an example ?

Thanks Jeremy!

I had a couple of questions based off this lesson that I wasn’t able to fully figure out at least from the class lecture and the text of chapter 4 of the book itself:

  1. In chapter 4 of the book, at the beginning of the section entitled “The MNIST Loss Function”, we transform our rank-3 tensors into rank-2 tensors. As I understood it, we basically arrange all the 784 individual pixel values so that they’re one after another instead of presented as a 28x28 grid. Why do we do that?
  2. In the section entitled “Adding a Nonlinearity” the basic neural network code includes lines like res = xb@w1 + b1. What is res? What does it stand for?

I guess it allows getting a 2D matrix instead of 3D volume. So you have:

(n_samples, h, w) -> (n_samples, h * w)

Then you can compute matrix multiplication with weights.

X.shape == (n_samples, h * w)
W1.shape == (h * w, n_hidden)

# don't remember for sure if a columnar or row tensor, maybe just (N,)
(X @ W1 + b1).shape == (N, 1)

So a 28x28 is flattened into a 784 1-d vector to align shapes and get expected matrix multiplications.

1 Like

Basically, I also think so. The authors probably wanted to encourage the intuition about a data point going through the layer.
If we demount the matrix of a data point into an array, we can easily visualize the dot product of the data point with the layer.
In this convenient way, the dimensions easily match.

3 Likes

Result, probably.

2 Likes

I didn’t fully understand this explanation. Not quite sure what the == and the -> denote.

I guess I had assumed the transformation or conversion into 2D was to make sure the shapes align, but I thought we’d seen a few examples (broadcasting, primarily) where PyTorch was able to do some magic on its side to make non-equivalent dimensions work together. I think I had assumed it’d be able to do it in this case as well?

Yeah, sorry, I was about to say that we need to make sure that the shapes are properly aligned, like:

import torch
import torch.nn as nn

# ✅ this works
X = torch.ones((64, 28 * 28))
w = torch.randn(28 * 28, 10)
b = torch.ones((1, 10))
X@w + b  # shape (64, 10)

# ❌ this fails
X = torch.ones((64, 28, 28))
w = torch.randn(28 * 28, 10)
b = torch.ones((1, 10))
X@w + b 
# RuntimeError: mat1 and mat2 shapes cannot be multiplied (1792x28 and 784x10)

# ✅ this, for example, works again (though the opeation is different this time)
X = torch.ones((64, 1, 28, 28))
w = torch.randn(28, 28, 10)
b = torch.ones((1, 10))
X@w + b  # shape (64, 28, 28, 10)

I would say it depends on the shapes of your tensors and broadcasting rules adopted in torch. I am not quite sure if they are the same as for NumPy. But probably something similar.

So depending on your context, it could be done automatically by the library, or requires explicit shapes transformation.

3 Likes

Mh, let’s take broadcasting apart for a moment. It’s just a convenient way, typical of vectorized and lazy languages, to spare code, headaches, and computer resources.

I’d like to know what leaves you perplexed, please give us more context.

For now, I’ll attempt commenting, but maybe that’s not really what you were asking. Like I said, let me know.

The important thing to catch is that it’s not just about having matching dimensions, but also about what we want to get.

Observe what follows:

Note that:

  1. The authors called this the prediction relative to a single image. Below, you can see the respective shapes.
  2. Forget about minibatches, we are just sending ONE image, demounted as array, into a single linear layer.
  3. Leave apart the .T transpose operator for now. Note that if a and b are 1D arrays (rank-1 tensors), then doing (a*b).sum() is just a@b.
  4. Note that the bias is just a scalar (rank 0 tensor).
  5. Of course the result of the whole machinery is also a scalar.

If you are unsettled by the fact that the authors demounted the matrix representing the image into an array, and care just about matching dimensions, we can leave the image as matrix, and arrange the weight array into a matrix too.

Now, do perform the matrix product between the 28x28 matrix representing the image (call it X) and the 28x28 weight matrix W. That is, do X@W.
Which shape will the result have?

2 Likes

I literally gawked at this and couldn’t “see” the difference between the two snippets of code. So I popped them into a notebook and sure enough the second one doesn’t work. After looking at the two statements for X = … , I realized that the second one is actually creating a 64 by 28 by 28 matrix while the first one is creating a 64 by 784 matrix.

So in the first one we’re multiplying a 64x784 with a 784x10 matrix, and sure enough the output would be 64x10

Sometimes I really wish matrix creation logic would require signatures like function or type signatures because it’s just a single comma that’s different between the two statements and so easy to gloss over (at least for me.)

5 Likes

Hi there all, when having large image labeled images for training, how do we go about in fastai to just get a subset loaded on the Dataloader to test different models?

Thanks

Yeah, absolutely! Tensor shapes getting tricky very quickly. Especially, for sophisticated architectures. I guess that was one of the motivations behind named tensors. Also, einops can be a good approach to reduce complexity via reducing the number of lines. I’ve seen a Transformer block re-written in this notation, and it looked quite concise and easy to think about.

5 Likes

I have a question: so, we covered deployment aspects in lesson 2. Could Docker be included or could we cover it at some point, if possible?