This is a wiki post - feel free to edit to add links from the lesson or other useful info.

## Lesson resources

- Lesson Videos

## Links from the lesson

- The course’s fashion mnist challenge topic
- Excel optimisers spreadsheet
- Papers
- Papers with Code

This is a wiki post - feel free to edit to add links from the lesson or other useful info.

- Lesson Videos

- The course’s fashion mnist challenge topic
- Excel optimisers spreadsheet
- Papers
- Papers with Code

12 Likes

When we do self.schedo = self.sched , isn’t there is a risk that modifying self.schedo would change self.sched. Or is that intended to happen?

We don’t do that - we do `self.schedo = self.sched(learn.opt)`

, which *calls* the `sched`

callable to create the object.

2 Likes

oh ok. Yeah, that makes sense now. Thanks.

1 Like

Going through the lesson again, I’ve noticed that we don’t pass the norm to the `_conv_block`

in the `ResBlock`

, so this awesome result is without batchnorm.

```
class ResBlock(nn.Module):
def __init__(self, ni, nf, stride=1, ks=3, act=act_gr, norm=None):
super().__init__()
self.convs = _conv_block(ni, nf, stride, act=act, ks=ks) # This line is missing norm=norm
```

Fixing the issue gives lower result of `0.918`

(without norm it was `0.922`

), but I haven’t played with `lr`

yet.

1 Like

Oops! Well spotted.

I still get 0.922 after fixing it FYI.

Fixing the batchnorm problem, and then removing the line that inits conv2 bn weights to zero, results in all the models I’ve tried so far getting better results.

1 Like

I’ve updated the “leaderboard” topic with the latest results now:

Regarding calculating flops for models, I discovered that the fvcore library includes a flop counter for PyTorch models.

3 Likes

What blows my mind watching this : the weights are all initialised from random parameters and they all converge in a few epochs to such a high level of accuracy, not to mention the fully handrolled training loop and model architecture.

4 Likes

I’m currently in lesson 17 and it’s just excellent! After seeing your comment, I can’t stop myself from watching lesson 18

2 Likes

As Jeremy mentioned, the proposed homework for this lesson was indeed a great learning exercise. I had to review Part 2 to practice what we have been taught about Python, PyTorch and miniai. And it ended being inspired on fastai’s scheduler.

It implements for `SchedCos`

, `SchedExp`

, `SchedExpFastai`

, `SchedLin`

, `SchedNo`

, `SchedPoly`

.

It is also possible to combine schedulers with `CombineScheds`

and has `OneCycleSched`

and `FlatCosSched`

.

Here the notebook:

4 Likes

Update: I like using the module summary tools included with TorchEval more than the fvcore library. You can convert the markdown table to a Pandas DataFrame to make it easily filterable.

```
def markdown_to_pandas(table_string):
rows = table_string.strip().split("\n")
header = rows[0].split("|")[1:-1]
header = [x.strip() for x in header]
data = [row.split("|")[1:-1] for row in rows[2:]]
data = [[x.strip() for x in row] for row in data]
return pd.DataFrame(data, columns=header)
```

```
test_inp = torch.randn(1, 3, *[train_dataset.size]*2).to(device)
summary_df = markdown_to_pandas(f"{get_module_summary(style_transfer_model, [test_inp])}")
summary_df[(summary_df.index == 0) | (summary_df['Type'] == 'Conv2d')]
```

Generates to the following table:

Type | # Parameters | # Trainable Parameters | Size (bytes) | Contains Uninitialized Parameters? | Forward FLOPs | Backward FLOPs | In size | |
---|---|---|---|---|---|---|---|---|

0 | TransformerNet | 393 K | 393 K | 1.6 M | No | 6.9 G | 13.6 G | [1, 3, 512, 512] |

3 | Conv2d | 448 | 448 | 1.8 K | No | 113 M | 113 M | [1, 3, 514, 514] |

6 | Conv2d | 136 | 136 | 544 | No | 33.6 M | 67.1 M | [1, 16, 512, 512] |

11 | Conv2d | 528 | 528 | 2.1 K | No | 33.6 M | 67.1 M | [1, 32, 256, 256] |

18 | Conv2d | 36.9 K | 36.9 K | 147 K | No | 603 M | 1.2 G | [1, 64, 130, 130] |

22 | Conv2d | 36.9 K | 36.9 K | 147 K | No | 603 M | 1.2 G | [1, 64, 130, 130] |

28 | Conv2d | 36.9 K | 36.9 K | 147 K | No | 603 M | 1.2 G | [1, 64, 130, 130] |

32 | Conv2d | 36.9 K | 36.9 K | 147 K | No | 603 M | 1.2 G | [1, 64, 130, 130] |

38 | Conv2d | 36.9 K | 36.9 K | 147 K | No | 603 M | 1.2 G | [1, 64, 130, 130] |

42 | Conv2d | 36.9 K | 36.9 K | 147 K | No | 603 M | 1.2 G | [1, 64, 130, 130] |

48 | Conv2d | 36.9 K | 36.9 K | 147 K | No | 603 M | 1.2 G | [1, 64, 130, 130] |

52 | Conv2d | 36.9 K | 36.9 K | 147 K | No | 603 M | 1.2 G | [1, 64, 130, 130] |

58 | Conv2d | 36.9 K | 36.9 K | 147 K | No | 603 M | 1.2 G | [1, 64, 130, 130] |

62 | Conv2d | 36.9 K | 36.9 K | 147 K | No | 603 M | 1.2 G | [1, 64, 130, 130] |

66 | Conv2d | 16.5 K | 16.5 K | 66.0 K | No | 268 M | 536 M | [1, 128, 128, 128] |

71 | Conv2d | 4.2 K | 4.2 K | 16.6 K | No | 268 M | 536 M | [1, 64, 256, 256] |

77 | Conv2d | 435 | 435 | 1.7 K | No | 113 M | 226 M | [1, 16, 514, 514] |

3 Likes

Two new optimisers were recently published Lion (Chen 2023) and dadaptation (Defazio 2023). Both need a bit more epoch to get good results but are very competitive with AdamW.

I had a deeper look at Lion, it is simpler, faster, smaller than Adam or DAdaptAdam.

It exposes a somehow hidden fact that Adam when things go well updates parameters with learning rate ignoring the gradient scale, and lion makes it explicit.

Have a look how easy it is (the code updates only one parameter for simplicity):

```
def sgd(lr): # for comparison with lion
def sgd_step(w, g):
return w - lr * g
return sgd_step
```

```
def lion(lr=0.1, b1=0.9, b2=0.99):
lion.exp_avg = 0 # shared state betwen multiple calls to lion_step
def lion_step(w, g):
sign = np.sign(lion.exp_avg * b1 + grad * (1 - b1)) # s is 1 or -1
lion.exp_avg = lion.exp_avg*b2 + (1-b2)*g
return w - lr * sign
return lion_step
```

@Mkardas made a nice notebook exploring how those optimisers work with one variable I will share it here once we get it polished.

3 Likes

‘fastai native and fused ForEach implementations’ are also available in Benjamin’s (@ bwarner) `fastxtend`

fastxtend - Lion: EvoLved Sign Momentum Optimizer

1 Like

Hi all,

Would someone be able to walk through how to calculate the number of parameters for the first layer of the resnet models at about 1h:20m in and 1h:30m in? I went back to the convolutions excel but wasn’t able to piece it together (think I’m having trouble conceptualizing how the resnet addition increases number of params).

I.e. what’s the math to get to 680 params for the first layer of the first example and 6864 params for the first layer of the second example.

Thanks!

@pack765 A simple way to calculate the number of params is ((kernel * ni ) + 1 ) * nf . The demo below might help understand this .

1 Like

I’ve written a blog post attempting to explain annealing and an implementation of Cosine Annealing using the LRFinder(). I would appreciate any feedback on how to improve it or the website. https://the-learning-mechanic.github.io