Lesson 18 official topic

This is a wiki post - feel free to edit to add links from the lesson or other useful info.

<<< Lesson 17Lesson 19 >>>

Lesson resources

Links from the lesson

12 Likes

When we do self.schedo = self.sched , isn’t there is a risk that modifying self.schedo would change self.sched. Or is that intended to happen?

We don’t do that - we do self.schedo = self.sched(learn.opt), which calls the sched callable to create the object.

2 Likes

oh ok. Yeah, that makes sense now. Thanks.

1 Like

Going through the lesson again, I’ve noticed that we don’t pass the norm to the _conv_block in the ResBlock, so this awesome result is without batchnorm.

class ResBlock(nn.Module):
    def __init__(self, ni, nf, stride=1, ks=3, act=act_gr, norm=None):
        super().__init__()
        self.convs = _conv_block(ni, nf, stride, act=act, ks=ks) # This line is missing norm=norm

Fixing the issue gives lower result of 0.918 (without norm it was 0.922), but I haven’t played with lr yet.

1 Like

Oops! Well spotted.

I still get 0.922 after fixing it FYI.

Fixing the batchnorm problem, and then removing the line that inits conv2 bn weights to zero, results in all the models I’ve tried so far getting better results.

1 Like

I’ve updated the “leaderboard” topic with the latest results now:

Regarding calculating flops for models, I discovered that the fvcore library includes a flop counter for PyTorch models.

3 Likes

What blows my mind watching this : the weights are all initialised from random parameters and they all converge in a few epochs to such a high level of accuracy, not to mention the fully handrolled training loop and model architecture. :exploding_head:

4 Likes

I’m currently in lesson 17 and it’s just excellent! After seeing your comment, I can’t stop myself from watching lesson 18 :smiley:

2 Likes

As Jeremy mentioned, the proposed homework for this lesson was indeed a great learning exercise. I had to review Part 2 to practice what we have been taught about Python, PyTorch and miniai. And it ended being inspired on fastai’s scheduler.

It implements for SchedCos, SchedExp, SchedExpFastai, SchedLin, SchedNo, SchedPoly.
It is also possible to combine schedulers with CombineScheds and has OneCycleSched and FlatCosSched.

Here the notebook:

Lesson 18 homework - Creating schedulers for miniai

5 Likes

Update: I like using the module summary tools included with TorchEval more than the fvcore library. You can convert the markdown table to a Pandas DataFrame to make it easily filterable.

def markdown_to_pandas(table_string):
    rows = table_string.strip().split("\n")
    header = rows[0].split("|")[1:-1]
    header = [x.strip() for x in header]
    data = [row.split("|")[1:-1] for row in rows[2:]]
    data = [[x.strip() for x in row] for row in data]
    return pd.DataFrame(data, columns=header)
test_inp = torch.randn(1, 3, *[train_dataset.size]*2).to(device)
summary_df = markdown_to_pandas(f"{get_module_summary(style_transfer_model, [test_inp])}")
summary_df[(summary_df.index == 0) | (summary_df['Type'] == 'Conv2d')]

Generates to the following table:

Type # Parameters # Trainable Parameters Size (bytes) Contains Uninitialized Parameters? Forward FLOPs Backward FLOPs In size
0 TransformerNet 393 K 393 K 1.6 M No 6.9 G 13.6 G [1, 3, 512, 512]
3 Conv2d 448 448 1.8 K No 113 M 113 M [1, 3, 514, 514]
6 Conv2d 136 136 544 No 33.6 M 67.1 M [1, 16, 512, 512]
11 Conv2d 528 528 2.1 K No 33.6 M 67.1 M [1, 32, 256, 256]
18 Conv2d 36.9 K 36.9 K 147 K No 603 M 1.2 G [1, 64, 130, 130]
22 Conv2d 36.9 K 36.9 K 147 K No 603 M 1.2 G [1, 64, 130, 130]
28 Conv2d 36.9 K 36.9 K 147 K No 603 M 1.2 G [1, 64, 130, 130]
32 Conv2d 36.9 K 36.9 K 147 K No 603 M 1.2 G [1, 64, 130, 130]
38 Conv2d 36.9 K 36.9 K 147 K No 603 M 1.2 G [1, 64, 130, 130]
42 Conv2d 36.9 K 36.9 K 147 K No 603 M 1.2 G [1, 64, 130, 130]
48 Conv2d 36.9 K 36.9 K 147 K No 603 M 1.2 G [1, 64, 130, 130]
52 Conv2d 36.9 K 36.9 K 147 K No 603 M 1.2 G [1, 64, 130, 130]
58 Conv2d 36.9 K 36.9 K 147 K No 603 M 1.2 G [1, 64, 130, 130]
62 Conv2d 36.9 K 36.9 K 147 K No 603 M 1.2 G [1, 64, 130, 130]
66 Conv2d 16.5 K 16.5 K 66.0 K No 268 M 536 M [1, 128, 128, 128]
71 Conv2d 4.2 K 4.2 K 16.6 K No 268 M 536 M [1, 64, 256, 256]
77 Conv2d 435 435 1.7 K No 113 M 226 M [1, 16, 514, 514]
3 Likes

Two new optimisers were recently published Lion (Chen 2023) and dadaptation (Defazio 2023). Both need a bit more epoch to get good results but are very competitive with AdamW.

I had a deeper look at Lion, it is simpler, faster, smaller than Adam or DAdaptAdam.

It exposes a somehow hidden fact that Adam when things go well updates parameters with learning rate ignoring the gradient scale, and lion makes it explicit.
Have a look how easy it is (the code updates only one parameter for simplicity):

def sgd(lr): # for comparison with lion
    def sgd_step(w, g): 
        return w - lr * g
    return sgd_step
def lion(lr=0.1, b1=0.9, b2=0.99):
    lion.exp_avg = 0 # shared state betwen multiple calls to lion_step
    def lion_step(w, g):
        sign = np.sign(lion.exp_avg * b1 + grad * (1 - b1)) # s is 1 or -1
        lion.exp_avg = lion.exp_avg*b2 + (1-b2)*g
        return w - lr * sign 
    return lion_step

@Mkardas made a nice notebook exploring how those optimisers work with one variable I will share it here once we get it polished.

3 Likes

‘fastai native and fused ForEach implementations’ are also available in Benjamin’s (@ bwarner) fastxtend fastxtend - Lion: EvoLved Sign Momentum Optimizer

1 Like

Hi all,

Would someone be able to walk through how to calculate the number of parameters for the first layer of the resnet models at about 1h:20m in and 1h:30m in? I went back to the convolutions excel but wasn’t able to piece it together (think I’m having trouble conceptualizing how the resnet addition increases number of params).

I.e. what’s the math to get to 680 params for the first layer of the first example and 6864 params for the first layer of the second example.

Thanks!

@pack765 A simple way to calculate the number of params is ((kernel * ni ) + 1 ) * nf . The demo below might help understand this .

1 Like

I’ve written a blog post attempting to explain annealing and an implementation of Cosine Annealing using the LRFinder(). I would appreciate any feedback on how to improve it or the website. https://the-learning-mechanic.github.io :pray: