I have 2 RTX 3060 and was trying distributed training using fastai for segmentation following the tutorial from fastai - Distributed training . So while using single GPU, I am able to use batch size of 4 without any issues but while using distributed training I am running out of cuda memory on same batch size. Also reducing batch size to 2 leads to same issue.
This is the code I am using in my jupyter notebook
from accelerate import notebook_launcher
from fastai.distributed import *
def train():
print('Creating DataLoader')
dls = SegmentationDataLoaders.from_label_func(base_path, images, get_mask_label, valid_pct = 0.2,
seed = 42, codes = codes, batch_tfms = aug_transforms(), bs = 4
item_tfms = [Resize(size = 256, method = ResizeMethod.Squish, pad_mode = 'zeros')])
learn = unet_learner(dls, resnet50, metrics = accuracy).to_fp16()
with learn.distrib_ctx(in_notebook=True, sync_bn=False):
learn.fine_tune(3)
notebook_launcher(train, num_processes=2)
``