Hi, I’m new here so hopefully this is the right place to post.
I’ve recently moved my training to AWS Sagemaker notebooks and notebook jobs. I’ve quickly found I’m running out of memory so I’ve switched to an instance with more than 1 GPU however I’m struggling to distribute my training across both GPUs.
My understanding is I need to use the code snippet from here fastai - Distributed training and launch with accelerate launch
, however I can see no obvious way to do the latter for a notebook instance, equally so with notebook jobs. My assumption is that I could create a script which holds all of my training code which I then call from my notebook, but this feels very messy.
To clarify, I’m using “Amazon Sagemaker Studio” with notebooks and notebook jobs. If possible, I’d like to continue using notebooks such that, at least during active development, I can write code once and use it for both dev and training. However it feels like I may need to write it in a python script and launch this as a ‘deployment’? Any help hugely appreciated!
Thanks