I am working on a basic super resolution model using GAN and I am getting consistent hanging during training. It will just get to a batch and hang there at 109/700 or something similar. I am thinking it is finding a bad input image or something but I have no way of knowing what the actual error is or what image is corrupt. I couldn’t find anything in the code, is there a way to get verbose output or at least a way to print the error?
When it hangs it keeps the GPUs at 100% usage, but it is only pulling half the wattage on them as normal. And I made sure I am not running out of GPU memory.
If and when it hangs running as a single process/single GPU in notebook, can you interrupt the kernel and copy/paste the stack trace here so we can take a look?
If I recall correctly, distributed data parallel may not help GAN training much, due to the sequential nature of flipping back and forth between the generator and the critic.
Thank you for the reply. It turns out I am actually half wrong. It seems to only be happening during distributed training. Here is the error. The top errors about the train set actually don’t happen in the notebook. Maybe this is a key to the whole thing.
(fastai_01) robin@MOAC-LINUX:/srv/MachineLearning/FaceSuperres_01$ python -m torch.distributed.launch --nproc_per_node=3 FaceSuperres_12-Gan-RN34-SelfAttn-8x.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your train set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
if getattr(ds, 'warn', False): warn(ds.warn)
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
if getattr(ds, 'warn', False): warn(ds.warn)
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your train set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
if getattr(ds, 'warn', False): warn(ds.warn)
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
if getattr(ds, 'warn', False): warn(ds.warn)
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your train set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
if getattr(ds, 'warn', False): warn(ds.warn)
/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/fastai/data_block.py:541: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the following unknown labels, the corresponding items have been discarded.
FaceSuperres_01
if getattr(ds, 'warn', False): warn(ds.warn)
epoch train_loss valid_loss gen_loss disc_loss time
^CTraceback (most recent call last):-----------------------------------------------------------------------------| 11.43% [16/140 00:23<03:02 97.1484]
File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in <module>
main()
File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/site-packages/torch/distributed/launch.py", line 246, in main
process.wait()
File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/subprocess.py", line 1019, in wait
return self._wait(timeout=timeout)
File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/subprocess.py", line 1653, in _wait
(pid, sts) = self._try_wait(0)
File "/home/robin/anaconda3/envs/fastai_01/lib/python3.7/subprocess.py", line 1611, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
(fastai_01) robin@MOAC-LINUX:/srv/MachineLearning/FaceSuperres_01$