With the inlining I have 22 seconds with ReLU and 24 seconds per epoch with Mish. So it’s 9% slower, not sure about the memory difference though.
As mentioned, the imagewoof/imagenette setup is not great for measuring the timing. The dataset is so small the epoch dataloader transitions from train, test and back take up a lot of relative time and have a lot of overhead (proportionally) and variability. Train or validate on a bigger dataset like Imagenet itself to get a better measure.
A comparitve measure taken right before the end of longer validation run. The in brackets numbers are cumalitve averages and quite stable at this stage. GPU utilization is 99%.
Isn’t GPU memory directly linked to number of parameters?
Parameters is just part of it. The input size, parameters, and a whole lot of little details wrt to forward and backward mechanics and the caching allocator determine the practical memory usage for a given task. Even changing the arguments for a given conv could have a significant impact on the CUDNN workspace size for forward and/or backward as it may use a different algorithm winograd vs gemm and others that all have different layouts for the tensor data and require different allocations.
The EfficientNets are a great example, even at roughly 1/10 the parameter count they actually utilize as much or more GPU memory for a given performance range (accuracy). Increased input size, conv algo selection seems to be resulting in larger workspace sizes, and activations like swish that are implemented (currently) as sequences of python ops (more operations).
@LessW2020@Seb I am writing a walkthrough notebook for my meetup in two weeks over this, and I’m trying to go through and gather all of the papers that were used. Where did the Flatten and Anneal originate from?
Otherwise here is what I’ve gathered so far (I’ll update this post here in case anyone wants a quick reference to the papers):
@grankin came up with flat+anneal I believe.
simpleselfattention is inspired by Self-Attention GANs, I highly modified their layer, and came up with the positioning in xresnet that we used. Maybe I’ll write a paper if I find a good use to it. https://github.com/sdoria/SimpleSelfAttention (I need to improve that readme)
( I should add that @grankin implemented a “symmetrical” version which we didn’t use here and participated in the testing)
A big jump from the leaderboard was fixing that learning rate oversight in learn.py
Another one was from using the full size dataset rather than Imagewoof160
Another smaller one was adding more channels in xresnet.
I think you got the rest.
Thanks Seb! I appreciate the double check. @LessW2020 thanks for the post! I believe you missed the LARS paper though
I will certainly reference him twice then. Once I have the notebook written up, I can post it on your forum if you think it would be nice Less. I won’t do the 5 for 5 like we have been, it’s colab so it’ll just be 1 run of 5 for each effort
I also just put an “Other Equally as Important Noteables” for the non-papers
I made the promised post to try and provide an overview for everyone on the new techniques we’ve been using here:
re: missed paper - Good spot @muellerzr - I’ve added the LARS link!
Re: notebook - yes, please add it to the github repo, that will be a nice add for sure. Thanks for making that list of papers, that’s a big help for anyone to delve into more details.
@Seb - I had to stretch to summarize the self attention aspect in my post, so I’ve referenced you in that thread for people to ask for a tutorial about it It does look promising though after seeing the results here and a quick read of the paper.
Thanks Less! I will work on it eventually this week, as converting scheduler to callback is…a welcome challenge. I’m following this for the scheduler, but I think if I follow the fit_one_cycle code I should be fine
If I see results that show we can train Imagenette/woof to convergence faster with ssa on different image sizes, then a paper would make sense. So far I’ve only seen that on Imagewoof128 and it didn’t work (equal results) on Imagewoof256. Weird!
Re: Mixup. I’ve just gone by Jeremy’s intuition on the leaderboard. He uses Mixup for 80 epochs and more. When I did runs on 80 epochs, I used it.
I did briefly test with it but similar to @Seb, I figured if Jeremy wasn’t using it then it wasn’t a high priority.
That said, I did see consistently better short term validation results with it (i.e. val curve was much more below the training curve) vs not, but at the same time at least with OneCycle, I didn’t end up any more accurate.
So I think it’s worth testing now that we have the new lr schedule and Ranger…and for that matter, I think progressive sprinkles is another thing to test as I had really good luck with that (better than cutmix usually).
I made a new simple® setup for flat+cosine, fcfit:
#flat and cosine annealer - @mgrankin invented
#let's make it fast and easy - @lessw2020
def fcfit(learn, num_epoch=2, lr=4e-3, start_pct=.72, f_show_curve=True):
if num_epoch<1:
raiseValueError("num_epoch must be 1 or higher")
n = len(learn.data.train_dl)
anneal_start = int(n*num_epoch*start_pct) #compute what batch to start
batch_finish = (n*num_epoch - anneal_start)
phase0 = TrainingPhase(anneal_start).schedule_hp('lr', lr)
phase1 = TrainingPhase(n*5 - anneal_start).schedule_hp('lr', lr, anneal=annealing_cos)
phases = [phase0, phase1]
sched = GeneralScheduler(learn, phases)
#save the setup
learn.callbacks.append(sched)
#start the training
print(f"fcfit: num_epochs: {num_epoch}, lr = {lr}")
print(f"Flat for {anneal_start} epochs, then cosine anneal for {batch_finish}")
learn.fit(num_epoch)
#bonus -show lr curve?
if f_show_curve:
learn.recorder.plot_lr()