Since you are not mentioning what dataset you are using and what network you are training it is place were I would check for the likely reason why it isn’t training as fast as you would like.
For example, if you have small/toy dataset like fashion mnist you will have hard time utilising a100. In fact I have similar training speeds on M1 Pro, 2080ti and a100 on hyper optimised data loading code. I’ve spend quite some time making the models train fast, so I don’t have much incentive to check what is the bottleneck now. But I’ve got 5x gains by tuning the dataloading .
But do you really need to invest time in fixing this? If the network trains fast enough, just change the instance to something older and cheaper and get few experiments running at the same time.
If you like tuning, then the easiest way is to load the whole dataset in to the memory in form that is fast to make batches. This alone will give you quite some gain.