I have looked on the forum and even though similar types of issues are discussed I could not find any definite solution. Perhaps someone has a suggestion?
I have a DL box at home w/ 1080Ti GPU and 16GB of CPU memory (I know this is small, but I don’t believe it’s an issue) running Ubuntu 16.04 w/ 384.145 nvidia driver, CUDA 9.0, cudNN=7.4.2, Python 3.6.8, PyTorch 1.x (I tried the most recent nightly 1.1.0.dev20190504 just now) and recent fastai=1.0.52
I get an Illegal instruction (core dumped) error in lesson1-pets.ipynb on line learn.fit_one_cycle(4) if I run the script from the Python command line.
It starts, goes for 5-10% on all 4 cores and then fails no matter what batch size is (I tried as small as bs=2). I have set num_workers=0. If I run it from Jupyter, then the error is “The kernel appears to have died. It will restart automatically.” It does not look like I am running out of either CPU or GPU memory - I am monitoring both of them as the script runs (25% is hardly consumed).
If I don’t set num_workers=0 in DataBunch, then I get a different error “RuntimeError: DataLoader worker (pid 2781) is killed by signal: Illegal instruction.” again after it is execulted for 5-10% on all CPU 4 cores - this has also been reported on this forum.
I get similar core dump error from Python if I run another notebook with large images from lesson 3 (I will post details shortly), but otherwise I have no problem executing any of the other notebooks or running any of the DL1 or DL2 notebooks from v2 of the class (2018) using fastai v0.7 on the same DL box in another conda environment.
From reading the forum it seems that the problem may be related to the Pytorch DataLoader, but I thought it was supposed to be fixed in PyTorch 1.0? What is it related to then? Do I need to update CUDA to v.10 or upgrade nVidia drivers? Anyone had a similar problem?