Has anyone had success using distrib_ctx?

I have tried a variety of custom configurations using accelerate config and running my scripts with accelerate launch. I have been trying to get multi-GPU training working and when I run my training script it only uses one or completely freezes/stalls but some memory is allocated to both GPUs.

Then I tried configuring to do multi-CPU training to determine if the accelerate configuration mattered at all. I got errors stating that it allocated too much memory to cuda device 0. That is my GPU! Why does it ignore the configuration and allocate memory to my GPU in multi-CPU training?

Accelerate Config File:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 0

Training Script:

from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.data import SegmentationDataLoaders
from fastai.vision.augment import Resize, aug_transforms, IntToFloatTensor
from fastai.data.transforms import get_image_files

path = Path('./data')

def get_label(o: Path) -> Path:
    return path/'masks'/o.name

dls = SegmentationDataLoaders.from_label_func(
    path,
    bs=4,
    fnames=get_image_files(path/"images"),
    label_func=get_label,
    codes=np.loadtxt('codes.txt', dtype=str),
    valid_pct=0.2,
    seed=42,
    item_tfms=Resize(640, method='crop'),
    batch_tfms=[*aug_transforms(), IntToFloatTensor(div=255)],
    num_workers=0
)

metrics = [foreground_acc, DiceMulti, JaccardCoeffMulti]
learn = unet_learner(dls, resnet18, metrics=metrics).to_fp16()

early_stopping = EarlyStoppingCallback(monitor='valid_loss', min_delta=0.0001, patience=5)
csv_logger = CSVLogger(Path(f'history/model.csv'), append=True)

with learn.distrib_ctx(sync_bn=False):
    learn.fine_tune(30, cbs=[early_stopping, csv_logger])

When using four GPUs, the main script runs four times, meaning that the model is downloaded four times, which takes time, and causes out of memory error.

Hello,
The fastai.distributed module is designed to handle multi-GPU training on its own, and when you use accelerate launch, you are essentially telling Accelerate to do the same thing. This creates a conflict where they both try to manage the GPUs, leading to the issues you’ve described.

Specifically, the line from fastai.distributed import * is telling FastAI to set up its own distributed environment, which then interferes with how Accelerate tries to configure the multi-GPU setup.

To fix this, remove the FastAI distributed training imports. Your script should rely solely on Accelerate for handling the multi-GPU setup.

Here’s the corrected approach:

Remove the FastAI distributed import.
Change from fastai.distributed import * to simply import fastai. The rest of your script should remain the same.

Adjust your accelerate config.
Your current config seems correct for multi-GPU training. Ensure that num_processes is set to the number of GPUs you have (e.g., 2 for your two GPUs).

Launch the training script with accelerate launch.
Run your training script using accelerate launch your_script_name.py.

Yeah, I hit the same issue before. Dropping from fastai.distributed import * and just letting Accelerate handle the GPUs fixed it for me. Once the accelerate config matched my setup, everything ran smoothly. Totally agree, best to let Accelerate take over and skip FastAI’s distributed imports.