Get meaningful error messages out of data_bunch loaders

Hi!

I am trying to build an out-of-core ItemList. My first attempt is using HDF5 as block storage. You can do out of core writes and reads with it. However, Hdf5 does not play nice with multiprocessing. I suspect that not even if you open the same file, in each process, you are not safe. [at least this is written in the comments of some h5 examples]

I tried writting such a ItemList and the data_bunch.one_batch() call silently crashes with a KeyboardInterrupt exception. This exception is triggered somewhere in the code below data_bunch.one_batch() (eg in a print()). This is I suspect a multiprocessing issue.

Below, is an exception example:

Exception ignored in: <function _DataLoaderIter.__del__ at 0x7f6561071bf8>
Traceback (most recent call last):
  File "/home/visoft/miniconda3/envs/ta/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 717, in __del__
    self._shutdown_workers()
  File "/home/visoft/miniconda3/envs/ta/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 713, in _shutdown_workers
    w.join()
  File "/home/visoft/miniconda3/envs/ta/lib/python3.7/multiprocessing/process.py", line 140, in join
    res = self._popen.wait(timeout)
  File "/home/visoft/miniconda3/envs/ta/lib/python3.7/multiprocessing/popen_fork.py", line 48, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/home/visoft/miniconda3/envs/ta/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt: 

Question: How can I get meaningful error messages from inside the data loader workers?

p.s. My next focus is using numpy’s memory mapped files and if that fails, something heavier.

Updates: The hdf5 is not an option, however, the memory mapped files are.
At least, in few tests that I did, hdf5-backed ItemList returns misalligned items and labels. This is a no-go.

I tried the same logic but using numpy memory mapped files and it worked. I got consistent items and labels. HOWEVER when debugging the code I still got a strange behavior.
Setting a breakpoint below the data_bunch.one_batch() crashes the code with KeyboardInterrupt exception inside lib/python3.7/site-packages/numpy/core/arrayprint.py. So, my original question is still relevant.

When I am running the code, without debugger, there is no exception dough. [I use pycharm IDE]

Thank you!

LE: I guess the error is triggered when cleaning up the resources (eg probably closing a memfile opened in another process?).

To get meaningful errors from pytorch DataLoader, you have to stop multi-processing by passing num_workers=0.

Hi! Thanks for looking into it!

The error have something to do with (1) doing multiprocessing AND (2) running in debug mode.
Setting num_workers=0 raises no errors. Once I set this to >=4 (why 4? Ok, I have an 4 core Intel processor) I got errors [Only running in debug mode from IDE]. Not always the same error. But all linked to some multiprocessing/multithreading issues.

Process finished with exit code 130 (interrupted by signal 2: SIGINT)

or

Exception ignored in: <function _DataLoaderIter.__del__ at 0x7f6e28793d90>
Traceback (most recent call last):
  File "/home/visoft/miniconda3/envs/ta/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 717, in __del__
    self._shutdown_workers()
  File "/home/visoft/miniconda3/envs/ta/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 713, in _shutdown_workers
    w.join()
  File "/home/visoft/miniconda3/envs/ta/lib/python3.7/multiprocessing/process.py", line 140, in join
    res = self._popen.wait(timeout)
  File "/home/visoft/miniconda3/envs/ta/lib/python3.7/multiprocessing/popen_fork.py", line 48, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/home/visoft/miniconda3/envs/ta/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt: 

Thank you!

@visoft Check this post here : Link it might help.

1 Like

Thanks. I added the torch.multiprocessing.set_sharing_strategy('file_system'). No change in the behavior.

Update 2: It seems that I can ignore that error for now. When debugging I just set the num_workers to 0. Also, wrt to the reason why I stumbled upon this error: Out of core data block ItemList backed up by memmap files