Distributed Training Running out of cuda memory

I have 2 RTX 3060 and was trying distributed training using fastai for segmentation following the tutorial from fastai - Distributed training . So while using single GPU, I am able to use batch size of 4 without any issues but while using distributed training I am running out of cuda memory on same batch size. Also reducing batch size to 2 leads to same issue.

This is the code I am using in my jupyter notebook

from accelerate import notebook_launcher
from fastai.distributed import *
def train():
    print('Creating DataLoader')
    dls = SegmentationDataLoaders.from_label_func(base_path, images, get_mask_label, valid_pct = 0.2, 
                                        seed = 42, codes = codes, batch_tfms = aug_transforms(), bs = 4
                                        item_tfms = [Resize(size = 256, method = ResizeMethod.Squish, pad_mode = 'zeros')])
    learn = unet_learner(dls, resnet50, metrics = accuracy).to_fp16()
    with learn.distrib_ctx(in_notebook=True, sync_bn=False):
        learn.fine_tune(3)
        
notebook_launcher(train, num_processes=2)
``