Distributed training on AWS Sagemaker notebooks/notebook jobs

ollyrennard · April 14, 2023, 9:42am

Hi, I’m new here so hopefully this is the right place to post.

I’ve recently moved my training to AWS Sagemaker notebooks and notebook jobs. I’ve quickly found I’m running out of memory so I’ve switched to an instance with more than 1 GPU however I’m struggling to distribute my training across both GPUs.

My understanding is I need to use the code snippet from here fastai - Distributed training and launch with accelerate launch, however I can see no obvious way to do the latter for a notebook instance, equally so with notebook jobs. My assumption is that I could create a script which holds all of my training code which I then call from my notebook, but this feels very messy.

To clarify, I’m using “Amazon Sagemaker Studio” with notebooks and notebook jobs. If possible, I’d like to continue using notebooks such that, at least during active development, I can write code once and use it for both dev and training. However it feels like I may need to write it in a python script and launch this as a ‘deployment’? Any help hugely appreciated!

Thanks

nglillywhite · April 17, 2023, 2:05am

Hey @ollyrennard,

Welcome to the forums, and thanks for posting your question!

I’m also looking to start using sagemaker/AWS with fast.ai, if you have any other resources, best practices, guidance for this sort of ecosystem I’d be appreciative.

I’m getting started in this world soon and will post back here when/if I run into a similar situation, otherwise shoot me a DM / discord message and we can collab to figure this out

ollyrennard · April 17, 2023, 12:59pm

hey @nglillywhite, I’ve not made any progress with this specifically as I need to do some reading, but I’d be happy to give some tips on Sagemaker, I’ve found it a bit of a pain to get up and running with!