Jupyter notebook dies/freezes during training

benediktschifferer · April 23, 2017, 1:09pm

Hello together,

When I start the training of a neural network the jupyter notebook dies / freeze.
Does someone have a similar experience? How can I improve it?

I use AWS p2 instance with the fast.ai image.
It happend with the theano backend (part I) or tensorflow (part II)

I write my python code in jupyter notebook. When I execute the command:

model.fit(map_train, y_train, batch_size=128, nb_epoch=1, verbose=1, validation_data=(map_valid, y_valid))

The notebook is “busy” and don’t react anymore. I think in the background it still train the model, but I cannot see any updates:

After a while, the brower asks me, if want to wait or want to kill the application.
I can still access the machine via SSH.

kzuiderveld · April 23, 2017, 2:27pm

You might be running out of GPU memory, reduce the batch size to 16 or 32.

Even · April 23, 2017, 5:23pm

I highly recommend changing to verbose=2 which updates every epoch rather than every batch. There can be issues with keras where it essentially floods the stdout with updates to the point where it causes the notebook to freeze up. The model is still training in the background but you often won’t even get the update that it’s finished.

If you do want per batch updates on your model in the notebook there is a package that you can use that overrides the default one called tqdm. I’d recommend checking out this issue: https://github.com/fchollet/keras/issues/4880

iNLyze · April 23, 2017, 8:25pm

I’d rather say this looks like out of memory, either GPU or CPU RAM. If you run out of CPU RAM you typcially won’t see that unless you were monitoring through top and see free mem going down. At least for me this is the most frequent reason for dying python kernels. And usually the rest of the system stays fine which is good. I sometimes have situations where there is one model running fine on GPU, while another notebook runs out of CPU RAM. The latter one dies, the former stays active. That’s rather benign.

simoneva · April 24, 2017, 3:53pm

There is bug in jupyter handling of progress bars. Best solution is to set verbose=0 and install tqdm_keras which gives working progress bars.

Surya501 · April 25, 2017, 7:28am

There is no way to upvote… this is the solution to your issue. Also don’t forget to check your gpu usage as others suggested.

benediktschifferer · April 29, 2017, 8:35am

hello @simoneva, thank you very much… that worked really well for me

mribbons · May 27, 2017, 10:31pm

I had this issue on a server I built myself, the aws ami didn’t have the problem for me.

The telltale sign of this issue is that after the Epoch x/y [=====>…] output you see a bunch of rectangles or other characters, similar to this post:

Note that I don’t agree with the answers on that post, it should work without tqdm_keras, this is just a nice enhancement.

I was able to resolve the issue by running

conda update --all

Then restart jupyter.

WaterRocket8236 · October 23, 2017, 9:49am

This is a minor bug in jupyter. I encountered the same issue few days ago. Reinstalling the tensorflow, anaconda helped overcoming it.

shuishoudage · February 1, 2018, 12:10am

@benediktschifferer
I have my workstation. I have encountered same problem when training lesson 1 vgg16 model on my local machine, jupyter-notebook will totally power off my GPU. the issue caused by jupyter-notebook progress bar. I have applied the solution of @simoneva. it worked for me. I think the reason you cannot view jupyter-notebook because your notebook crash your GPU

tmcanty · May 2, 2018, 6:47pm

What is the impact of lowering batch size on the rate at which the learner converges onto the fit. If we decrease batch size should we increase # of cycles and epoch?

What about trade off in the sz that we pass into get_data?