Python refuses abiding by SIGKILL

balnazzar · October 23, 2018, 11:38pm

I have the python process from fastai v1 env that is not quitting after being issued a signal 9.

Note that it is NOT in D state (uninterruptible sleep), so it is not executing kernel code. So, it should quit immediately. Of course is is occupying one thread at 100%.

Suggestions? I refuse to solve the problem by rebooting.

Also, what could have caused such a strange behaviour?
The notebook does freeze while executing this cell from Lesson1:

data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=224)
data.normalize(imagenet_stats)

It is fastai 1.0.11

stas · October 24, 2018, 12:23am

Could you turn it into a small reproducible test, that would be the easiest to debug?

There was a similar problem in a vision test that jeremy fixed. it was also due to data.normalize called several times. Basically, use the existing test and create your own version that reproduces your problem.

balnazzar · October 24, 2018, 3:32pm

Thanks for your reply.

I’m not quite catching what you mean by:

Meanwhile, I’ll try to provide a better description:

I run the Lesson1 nb cell by cell.
As I try to execute the cell you can see above, the python kernel spawned by jupyter does hang.
Kernel interrupt, restart and shutdown have no effect.
If I check with ps ax or htop, they show the python process (from fastai v1 anaconda env) as running (R state), absorbing 100% of one CPU thread.
Interestingly enough, top shows the same PID, but it’s labeled libzmq (ZeroMQ is a library for asynchronous process communication).
The process refuses to quit no matter numerous sigkills. Note that any process which is not in state D has to quit immediately, so this is unprecedented and theoretically impossible.
The machine remains usable and stable, still you cannot reboot or shut it down. One has to execute a hard reset by pressing the button. That machine is rock solid, it performms numerous tasks and never shown any such problem.

marcmuc · October 24, 2018, 4:36pm

I have had similar problems in the past (not fastai related). The reason then was that the python process was actually a subprocess that was spawned by some other process. In that state the spawning process can intercept all SIGKILLs etc. so that the process doesnt seem to react. There can be unrecoverable states I think, when the spawning process is somehow half-dead. I think for me in one case it was solved by closing all terminals and killing a number of other processes. But I have also had to hardreset in other cases…

just googled this, is really old but sounds like a plausible explanation…

stas · October 24, 2018, 5:14pm

It’s hard to debug this kind of problems when you have a whole notebook to deal with, so it’s always good to reduce it to a smallest possible amount of code that reproduces the problem - and that’s the test.

In the link I posted in my reply, not as severe, but also a hanging problem was occurring. I reduced it to:

def test_clean_tear_down(path):
    docstr = "test DataLoader iter doesn't get stuck"
    data = ImageDataBunch.from_folder(path, ds_tfms=(rand_pad(2, 28), []))
    data.normalize()
    data = ImageDataBunch.from_folder(path, ds_tfms=(rand_pad(2, 28), []))
    data.normalize()

and added to test_vision.py. Once this was done, jeremy quickly knew what the problem was and fixed it.

Is my suggestion more clear now?

Further notes:

note that jupyter introduces another level of complexity, so taking it out of that environment into a straight python script is very helpful.
if you do that and succeed at writing a script that reproduces your problem, then I’d try to bisect first pytorch builds (i.e. trying some earlier nightly builds and see if the problem is still there) and then fastai versions.
if you continue using the notebook instead of the script, then i’d also try to bisect other versions of jupyter notebook (especially as @marcmuc, pointed to a similar issue report there) .

And if you do make a short script please share it, perhaps it’d be useful for the fastai test suite.

Since there were a few thousands people who have run lesson1 w/o this problem, it probably has something to do with your unique setup/build. Can you please post the output of:

python -c "import fastai; fastai.show_install()"

and whether you built anything from source for that env.

You can find details about the fastai test suite here.

jeremy · October 24, 2018, 6:33pm

Also, and perhaps most importantly: these kinds of problems are unlikely to occur if you use one of the platforms we recommend - if you use your own machine, your particular combo of libs, OS, hardware, etc can result in problems that are impossible for us to debug.

balnazzar · October 24, 2018, 7:00pm

@marcmuc: Thanks!

@stas: Crystal clear. I’ll try and run some tests as soon as I get home.

@jeremy:

Yes. But I also use Fastai (0.7 and now 1.x) at work for production stuff. I installed 1.x on a DGX some days ago, and I got to be sure it doesn’t hang on such machine (which does not belong to me, of course).
In other words, using fastai is not just about learning for me. Thanks for the feedback, I’ll keep you all posted!

jeremy · October 24, 2018, 8:27pm

Thanks for the clarification - my feedback is mainly a reminder for other folks reading.

Ah OK well if you send us one of those we’ll absolutely test it for you.

balnazzar · October 25, 2018, 4:38pm

I always imagined the fastai headquarters like some mega complex, umbrella corporation-like and preferably subterranean, full of DGXes!

jeremy · October 25, 2018, 4:43pm

Yes, when I’m working in my pyjamas on my couch I also like to imagine that.