Pytorch segfault in fastai docker images

g-clef · October 4, 2020, 8:33pm

Hello, all,

I’m trying to get a local install of fastai running. I was hoping that using the fastai docker images would spare me having to install the libraries myself, but running the book one demo is causing a segfault in pytorch, which I don’t seem to be able to fix.

My setup:
CPU: Intel® Core™ i7 CPU 950 @ 3.07GHz
RAM: 12GB
Video card: GeForce RTX 2070 SUPER
OS: Ubuntu 20.04 (clean, just re-installed)
Nvidia driver version: Ubuntu-provided nvidia-450
Followed instructions to install docker and nvidia-docker extensions.
Running torch.is_available() in the fastai docker container returns True.

The error:
In Jupyter the kernel crashes & is restarted at the first “learn.fine_tune(1)” line.
In dmesg, there’s a line that says: traps: python[1910] trap invalid opcode ip:7fc9d0d63869 sp:7fff30e315a0 error:0 in libtorch_cpu.so[7fc9cfa41000+6754000]

Has anyone seen this before, or have an idea of what I did wrong?

Thanks.

g-clef · October 4, 2020, 9:02pm

If it helps, I had fastai v1 working on this hardware back in March, but got derailed by life and just picked it back up now. So, I know this setup can work, but something’s changed since March that’s not agreeing with my setup.

g-clef · November 1, 2020, 7:21pm

In case it helps others, I’ve solved this. The problem was with PyTorch. It seems that torch 1.6 (and I think 1.7) was built with commands in libtorch_cpu that aren’t supported on my CPU. My system has an old CPU, but modern GPU (I resurrected my old gaming rig by putting a recent video card in it), so the packaged pytorch wouldn’t run on my machine.

That means building pytorch from source (which is a separate challenge) on my machine. Helpfully, Jeremy Howard updated fastai to support pytorch 1.7 this weekend, and it looks like that also works on master of pytorch. So manually compiling pytorch and torchvision, along with the master branch of fastai seems to be working at this point for me.