Has anyone had success using distrib_ctx?

SkilfulJester98 · March 11, 2025, 8:11pm

I have tried a variety of custom configurations using accelerate config and running my scripts with accelerate launch. I have been trying to get multi-GPU training working and when I run my training script it only uses one or completely freezes/stalls but some memory is allocated to both GPUs.

Then I tried configuring to do multi-CPU training to determine if the accelerate configuration mattered at all. I got errors stating that it allocated too much memory to cuda device 0. That is my GPU! Why does it ignore the configuration and allocate memory to my GPU in multi-CPU training?

Accelerate Config File:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 0

Training Script:

from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.data import SegmentationDataLoaders
from fastai.vision.augment import Resize, aug_transforms, IntToFloatTensor
from fastai.data.transforms import get_image_files

path = Path('./data')

def get_label(o: Path) -> Path:
    return path/'masks'/o.name

dls = SegmentationDataLoaders.from_label_func(
    path,
    bs=4,
    fnames=get_image_files(path/"images"),
    label_func=get_label,
    codes=np.loadtxt('codes.txt', dtype=str),
    valid_pct=0.2,
    seed=42,
    item_tfms=Resize(640, method='crop'),
    batch_tfms=[*aug_transforms(), IntToFloatTensor(div=255)],
    num_workers=0
)

metrics = [foreground_acc, DiceMulti, JaccardCoeffMulti]
learn = unet_learner(dls, resnet18, metrics=metrics).to_fp16()

early_stopping = EarlyStoppingCallback(monitor='valid_loss', min_delta=0.0001, patience=5)
csv_logger = CSVLogger(Path(f'history/model.csv'), append=True)

with learn.distrib_ctx(sync_bn=False):
    learn.fine_tune(30, cbs=[early_stopping, csv_logger])