GAN training hanging with no errors

I am working on a basic super resolution model using GAN and I am getting consistent hanging during training. It will just get to a batch and hang there at 109/700 or something similar. I am thinking it is finding a bad input image or something but I have no way of knowing what the actual error is or what image is corrupt. I couldn’t find anything in the code, is there a way to get verbose output or at least a way to print the error?

When it hangs it keeps the GPUs at 100% usage, but it is only pulling half the wattage on them as normal. And I made sure I am not running out of GPU memory.

Here is what the code looks like, nothing special.

switcher = partial(AdaptiveGANSwitcher, critic_thresh=0.65)
learn = GANLearner.from_learners(learn_gen, learn_critic, weights_gen=(1.,50.), show_img=False, switcher=switcher,
                                 opt_func=partial(optim.Adam, betas=(0.,0.99)), wd=wd)
learn.callback_fns.append(partial(GANDiscriminativeLR, mult_lr=5.))
try:
    learn = learn.to_distributed(args.local_rank)
except:
    pass

print('GAN training')
lr = 1e-4
learn.fit(5,lr)

This also happens if I just run the code in notebook and don’t distribute it across my GPUs.

If and when it hangs running as a single process/single GPU in notebook, can you interrupt the kernel and copy/paste the stack trace here so we can take a look?

If I recall correctly, distributed data parallel may not help GAN training much, due to the sequential nature of flipping back and forth between the generator and the critic.

Thank you for the reply. It turns out I am actually half wrong. It seems to only be happening during distributed training. Here is the error. The top errors about the train set actually don’t happen in the notebook. Maybe this is a key to the whole thing.

(fastai_01) robin@MOAC-LINUX:/srv/MachineLearning/FaceSuperres_01$ python -m torch.distributed.launch --nproc_per_node=3 FaceSuperres_12-Gan-RN34-SelfAttn-8x.py 
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your train set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
  if getattr(ds, 'warn', False): warn(ds.warn)
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
  if getattr(ds, 'warn', False): warn(ds.warn)
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your train set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
  if getattr(ds, 'warn', False): warn(ds.warn)
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
  if getattr(ds, 'warn', False): warn(ds.warn)
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your train set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
  if getattr(ds, 'warn', False): warn(ds.warn)
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
  if getattr(ds, 'warn', False): warn(ds.warn)
epoch     train_loss  valid_loss  gen_loss  disc_loss  time    
^CTraceback (most recent call last):-----------------------------------------------------------------------------| 11.43% [16/140 00:23<03:02 97.1484]
  File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in <module>
    main()
  File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/torch/distributed/launch.py", line 246, in main
    process.wait()
  File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/subprocess.py", line 1653, in _wait
    (pid, sts) = self._try_wait(0)
  File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/subprocess.py", line 1611, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
(fastai_01) robin@MOAC-LINUX:/srv/MachineLearning/FaceSuperres_01$

I met exactly the same problem.
GANLearner is stuck when run as to_distributed. I am using fastaiv1.

Did you solve the problem?
Any suggestions? @sgugger @muellerzr Please?