FastGarden - A new ImageNette like competition (just for fun)

I’m okay with this! Please help me make my code better :laughing:

I think we’d have to do something like so:

def get_items(*_, **__): return data

Unless you’ve tested your get_items @radek? (also this was another idea of Lucas’)

The problem with this is that if fastai changes and instead start passing two args to get_items this breaks =/

Another option would be:

def get_items(*_, **__): return data

EDIT: @muellerzr was faster than me hahahah

2 Likes

Ideally, you want your code to break :slight_smile: You don’t want it functioning when something meaningful changes somewhere else.

Also, as a principle, your software shouldn’t rely on code that changes more often than your code does. Things that change more often should depend on things that change less often. Otherwise it becomes a very complex mess that is very hard to navigate.

In summary, I think I would want my code to break should fastai start passing some other parameter there. And also I am a strong believer, like really strong, that code is meant to be read. You don’t write code for the machine, but for the people who will read it. Here I would for sure go for the option that better clarifies your intention, which is imo the one I proposed.

7 Likes

Well, I’m going to mark all of that down in my course notebook. This is some super interesting opinion thanks!

2 Likes

@muellerzr, thank you for sharing baseline…

What I find interesting is that on windows I got terrible result not close to 0.7.

|epoch |train_loss |valid_loss |accuracy |time|
|0 |5.323863 |4.554179 |0.073006
|1 |4.312253 |4.180884 |0.048222
|2 |4.185628 |4.160336 |0.048222
|3 |4.140559 |4.150017 |0.058459
|4 |3.964385 |3.919490 |0.085129

I start to wonder what may be the cause? Mish???
just ignore the blog’s ugliness just experimenting it, too.

Biggest difference I see is you have some added augmentation (specific parameters in aug_transforms())

@muellerzr, The first results were with out modification… Thank you… I should try more combinations to reach the base line…

1 Like

Could I please make a submission? :slightly_smiling_face: 72.36% with only 2 res50 models

This result is maybe not all that interesting in itself, there are a couple of things that might be of interest.

I learned what sa stands for -> self attention (defined in layers.py).

I tried to run experiments in the notebook learning a bit how res50 would train, but after training 3 models (despite not saving anything) I would get an out of memory (OOM) error. I remember seeing some better way of going about this used by the fast.ai crew, but forgetting what that was I opted to reach out for google fire.

The idea is to run

bash run_experiments.sh | tee res50.txt

this will run the bash script to train the models and save output into a txt file. Unfortunately I had a bug in my experiment.py and need to rerun :slight_smile:

I’m using nbdev, because for relatively that little code, I find it very helpful. It helps me DRY (do not repeat yourself) up my code.

I guess the interesting bit is doing the correlation matrix between results. If your models are less correlated, they do uncorrelated mistakes, you can hope for quite a score boost. Might be a nice way to figure out which models to combine.

image

Not sure if that was @muellerzr and @init_27 intention, but the techniques one can pick up when working on this can be quite helpful to getting one’s feet wet with kaggle, and probably to quite a good result :slight_smile:

5 Likes

It was not, mostly meant for a straight average IE model 1 got 72%, model 2 got 74%, reported average is 73% with standard deviation of ~1 (probably less than that just mental numbers)

However, we could possibly :slight_smile:

Nah, that is okay, so I got this wrong :slight_smile:

In such a scenario is not clear why one wouldn’t want to just make a submission of a single, best scoring model :slight_smile:

So that in general we know how the model performs and it’s not a one off.

Also the second post is a Wiki post :slight_smile:

PS: Sorry about the weird delay, my phone was glitching and wouldn’t respond :slight_smile:

1 Like

How do you do that?
You train 2 models then at inference take the mean of both models predictions before applying XEloss?

Looks that way to me :wink:

accuracy(preds.mean(0), targs)

(snippet from the end of the notebook)

Does “preds” is a concat of both model’s predictions?

Yes, see @radek’s bit on “Combining Predictions”

Which I might add, the covarience stuff is super cool!!!

Edit: I lied, it’s a torch.stack!

Sorry a did not see the notebook was provided :sweat_smile:
Indeed he does a torch.stack of his preds. (damn 1000 classes to predict!)

How do you read that covariance matrix though?
Are we looking for bigger numbers in the covariance diagonal, meaning the 2 predictions are strongly correlated?
or do we want the opposite?
Do we compare the covariance at all to the Variances of the individual predictions?

{ Var(x) E[(x-E[x])(y-E[y])] }
{ E[(y-E[y])(x-E[x])] Var(y) }

(Sorry not latex fluent here, somehow numpy returns a flipped matrix with variances in the other diagonal)

Yes, this is just the mean of the predictions (probabilities outputted by softmax).

This is a correlation matrix. For anyone who might want to learn more about this, Jeremy covers covariance (and correlation) in p2 v3 here. I also prepared these quiz / notes if this should be helpful - link will likely stop working sometime this weekend.

Since correlation is covariance normalized to (-1, 1) , the diagonal is 1 (the values vary linearly perfectly in the same direction).

Looking at 2nd row first column
image

we have 0.93 - this is the correlation of the predictions of the second model with the first.

Ideally, for combining model output (ensembling) we want our models to perform well and their predictions to be as uncorrelated as possible (the lower the number the better). Ideally, we would like the models to err in an uncorrelated way. Their uncorrelated errors cancel out and we get better predictions.

For some models (Random Forest for instance) we use bagging and feature sampling to aim for this.

6 Likes

If pre-trained models are not allowed from the start, is there a loophole to instead try knowledge distillation?
Mainly asking since we’ve seen pretty great results (granted, on an unrelated task) by combining KD models trained at different resolutions, and think it’d be fun to apply here!

Thanks, I think that’s what I understood:
The areas where our models differ are the areas where it’s basically difficult to learn so taking their mean has a chance to cancel their errors.
In your linked notebook it’s still a covariance matrix:
image

You are right, sorry, my bad.

Fixed it now :slight_smile: (git push failed yesterday when I was working on this and didn’t notice it)