I think the latest to_detach change broke the RNNCores forward method. I am getting a RuntimeError letting me know that input and hidden tensors are not on the same device. Since this wasn’t marked as a breaking change I guess it is a bug. How to proceed?
Most likely it’s related to this pytorch issue. And we would need to first reproduce this problem, and then reduce it to a simple test we could then file an issue with against pytorch.
I’ve now updated our CI to run the correct up-to-date conda package on MacOS. They confusingly renamed pytorch-nightly-cpu to pytorch-nightly some weeks back. But this build works fine.
So it’s still something related to pypi build, and other then potential nuances in the 2 different package builds, the main difference is that conda and pypi install targets are on different drives it seems on the CI build. That’s why I thought it could be related to this pytorch issue . Is there a chance you could try and reproduce it so that the env and the data are on different mount points? basically moving the test suite to another /mnt/ point. See: https://github.com/pytorch/pytorch/issues/4969#issuecomment-381132009
And for the sake of searchers the error is:
=================================== FAILURES ===================================
______________________ test_image_to_image_different_tfms ______________________
def test_image_to_image_different_tfms():
get_y_func = lambda o:o
mnist = untar_data(URLs.COCO_TINY)
x_tfms = get_transforms()
y_tfms = [[t for t in x_tfms[0]], [t for t in x_tfms[1]]]
y_tfms[0].append(flip_lr())
data = (ImageItemList.from_folder(mnist)
.random_split_by_pct()
.label_from_func(get_y_func)
.transform(x_tfms)
.transform_y(y_tfms)
.databunch(bs=16))
> x,y = data.one_batch()
tests/test_vision_data_block.py:96:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
fastai/basic_data.py:115: in one_batch
try: x,y = next(iter(dl))
fastai/basic_data.py:47: in __iter__
for b in self.dl:
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python3.6/site-packages/torch/utils/data/dataloader.py:631: in __next__
idx, batch = self._get_batch()
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python3.6/site-packages/torch/utils/data/dataloader.py:610: in _get_batch
return self.data_queue.get()
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python3.6/multiprocessing/queues.py:94: in get
res = self._recv_bytes()
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python3.6/multiprocessing/connection.py:216: in recv_bytes
buf = self._recv_bytes(maxlength)
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python3.6/multiprocessing/connection.py:407: in _recv_bytes
buf = self._recv(4)
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python3.6/multiprocessing/connection.py:379: in _recv
chunk = read(handle, remaining)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
signum = 20, frame = <frame object at 0x1050e8048>
def handler(signum, frame):
# This following call uses `waitid` with WNOHANG from C side. Therefore,
# Python can still get and update the process status successfully.
> _error_if_any_worker_fails()
E RuntimeError: DataLoader worker (pid 1201) is killed by signal: Unknown signal: 0.
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python3.6/site-packages/torch/utils/data/dataloader.py:274: RuntimeError
----------------------------- Captured stderr call -----------------------------
ERROR: Unexpected segmentation fault encountered in worker.
Oh, sorry, I don’t know osx, I assumed it’s the same as linux (mount-points-wise), but perhaps it’s not. I guess you need to go backwards from this solution, to reproduce the problem. Does it make sense?
I am not sure yet about testing it on linux - I will do that shortly myself. The CIs on linux and osx are configured identically, and only osx fails. But the original bug report is on linux, soI will certainly test that to rule it out.
ok i can do it on OSX - just wanted to make sure thats what you wanted.
i notice in that thread one person mentioned setting num_workers=0 to avoid masking what was actually wrong. I’ve experienced that myself on different occasions - if there is an assertion in a worker thread that is doing data loading or something it gets eaten/masked - might be worth trying in azure env
meantime i will split data and python env onto different mount points
i mounted (SSHFS) - a remote box so thats where the data/repo was - anaconda (python) env was on my local osx box - test_vision_data_block tests all passed 7/7
Something is different on the pypi setup that leads to this problem.
So far I have only identified as a potential difference that conda installs its stuff under /usr/local/miniconda/envs/fastai-cpu/, whereas pypi into /Users/vsts/… and the checkout goes into /Users/vsts/
Working (conda) env:
cwd: /Users/vsts/agent/2.142.1/work/1/s
=== Software ===
python version : 3.6.7
fastai version : 1.0.29.dev0
torch version : 1.0.0.dev20181126
torch cuda ver
torch cuda is : **Not available**
=== Hardware ===
No GPUs available
=== Environment ===
platform : Darwin-17.7.0-x86_64-i386-64bit
conda env : fastai-cpu
python : /usr/local/miniconda/envs/fastai-cpu/bin/python
sys.path :
/usr/local/miniconda/envs/fastai-cpu/lib/python36.zip
/usr/local/miniconda/envs/fastai-cpu/lib/python3.6
/usr/local/miniconda/envs/fastai-cpu/lib/python3.6/lib-dynload
/usr/local/miniconda/envs/fastai-cpu/lib/python3.6/site-packages
/Users/vsts/agent/2.142.1/work/1/s
/usr/local/miniconda/envs/fastai-cpu/lib/python3.6/site-packages/IPython/extensions
no supported gpus found on this system
Failing (pypi) env:
=== Software ===
python version : 3.6.5
fastai version : 1.0.29.dev0
torch version : 1.0.0.dev20181125
torch cuda ver
torch cuda is : **Not available**
=== Hardware ===
No GPUs available
=== Environment ===
platform : Darwin-17.7.0-x86_64-i386-64bit
conda env : Unknown
python : /Users/vsts/hostedtoolcache/Python/3.6.5/x64/python
sys.path :
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python36.zip
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python3.6
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python3.6/lib-dynload
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python3.6/site-packages
/Users/vsts/agent/2.142.1/work/1/s
/Users/vsts/hostedtoolcache/Python/3.6.5/x64/lib/python3.6/site-packages/IPython/extensions
no supported gpus found on this system
Yes, in general you need to switch to pytorch-nightly on macos/cpu, see this, since you’re using an outdated build. But I have already eliminated this as a potential culprit.
yes, this is where the culprit happens. thank you.
i removed anaconda python from my path (.bash_profile)
installed python 3.6 from python website
followed the install script replacing python with python3 and pip with pip3 (bc on mac python 2.7 is builtin)
ran -e .[dev] and pytest from a freshly git cloned repo
previously i was on anaconda, didn’t realize that would mask problem