Training on multiple GPU's from notebook

Hello, I am having I big struggle to get training going on 2 GPUs, when running code from jupyter notebooks.

Tried 2 approaches:

learn = language_model_learner(
    dls, AWD_LSTM, drop_mult=.3,
    metrics=[accuracy]
).to_fp16()
learn.model = nn.DataParallel(learn.model)
learn.fine_tune(1, 3e-2)

and

# Get number of GPUs
gpu = None
if torch.cuda.is_available():
    if gpu is not None: torch.cuda.set_device(gpu)
    n_gpu = torch.cuda.device_count()
else:
    n_gpu = None
    
# Get learner
learn = language_model_learner(
    dls, AWD_LSTM, drop_mult=.3,
    metrics=[accuracy]
).to_fp16()

# The context manager way of dp/ddp, both can handle single GPU base case.
if gpu is None and n_gpu is not None:
    ctx = learn.parallel_ctx
    with partial(ctx, gpu)():
        print(f"Training in {ctx.__name__} context on GPU {list(range(n_gpu))}")
        learn.fine_tune(2)
else:
    learn.fine_tune(2)

Both approaches raises a RuntimeError

RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:1 and hidden tensor at cuda:0

The only way I managed to get it running is by executing:

with learn.distrib_ctx():
  learn.fine_tune(1)

as python script python -m fastai.launch train.py

Can someone help me with this?