Dog breed exercise taking too long to train on Azure?


Sorry I didn’t find and appropriate post to add this to, hence creating a new one.

On Azure NC6 instance, with the dog breed exercise:

  •, 3, cycle_len=1, cycle_mult=2) -> 52 minutes
  •, 3, cycle_len=1) -> 22 minutes

Are durations expected? Or is too long and thereby pointing to a problem with my environment?

As an aside, I am planning to follow Faster experimentation for better learning for speedup in coding and then finally firing off the full learning later.


Can you check if it’s using GPU? You can do that by -

import torch

Thanks @ramesh.

Yes that returns True. And I can see GPU usage with “watch nvidia-smi -q -g 0 -d UTILIZATION -l”.

I take it that you also think the duration is longer than expected?

Hi, I’ve tried plant seedling on an nc6 and your run time seems ok. The speed also depends on the image size, batch size and architecture. You can try changing that if need to make it faster. Resnet is slower above 50 layers. I think inception will also be faster

Thanks @kanishkd4.

Is there a place I can find the list of architectures to play with. We cover 2 of them in L2 and you mention a 3rd. Are there others with their respective pros/cons?

I think a graph (accuracy vs operations) in that blog was also covered in one of the lectures. It could be hard to find pre-trained weights for all architectures - the fast ai library has some it can download automatically - another few you can find here.


Try reducing image size to 256. If you have very large resolution image, it will be slower to compute all the layers.

Freeze all layers - except the final few layers. That might also help with epoch runtime.