Lesson 3 on a GTX 1080 (8G) + tf allow_growth=True: final_model.fit_generator -> OOM [RESOLVED: use Theano backend, instead of Tensorflow]

gai · June 23, 2017, 12:41am

Hi,

I am trying to run the notebook for lesson 3 on a standard Hetzner ex51-ssd-gpu system (GTX 1080 [8G], 64G of RAM) but I keep getting this OOM error when running this cell:

final_model.fit_generator(batches, samples_per_epoch=batches.nb_sample, nb_epoch=1, 
                        validation_data=val_batches, nb_val_samples=val_batches.nb_sample)

Error message:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[32,64,226,226]
	 [[Node: Conv2D_40 = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](transpose_156, transpose_157)]]
	 [[Node: mul_64/_1211 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2232_mul_64", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Jeremy noted somewhere that tensorflow could be made using GPU memory more efficiently, so in my first cell I added this option (didn’t help):

def limit_mem():
    K.get_session().close()
    cfg = K.tf.ConfigProto()
    cfg.gpu_options.allow_growth = True
    K.set_session(K.tf.Session(config=cfg))
    
limit_mem()

I also tried reducing the batch numbers for the fitting like so (didn’t help):

 final_model.fit_generator(batches, samples_per_epoch=batches.nb_sample/2, nb_epoch=1, 
                            validation_data=val_batches, nb_val_samples=val_batches.nb_sample/2)

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 381.22                 Driver Version: 381.22                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:01:00.0     Off |                  N/A |
| 33%   33C    P8    10W / 180W |   7843MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     29731    C   /usr/bin/python2                              7829MiB |
+-----------------------------------------------------------------------------+

Any other ideas I could try?

EricPB · June 23, 2017, 3:52am

You don’t have another notebook still running in the background, like a previous version or a Lesson XYZ from Fast.ai ?

I sometimes get those OOM messages until I make sure all other notebooks are closed using the “Shutdown” button (ie. the icon goes from green to black in Jupyter file manager).
Then I do a proper ‘Kernel -> Restart and Clear Output’ on the current notebook or even a full restart of Jupyter Notebook with the Terminal/Command Line.

Edit 02/08: even faster, you can also see -and close- all current notebooks running by clicking on the “Running” tab, right to “Files” below the Jupyter logo.

E.

gai · June 23, 2017, 4:05am

If there was another notebook running, nvidia-smi would show more than one PID (I just tried opening a second notebook)… And yeah, I already tried restarting… but I never got past Cell #57 so far. I suppose I could free GPU memory somehow but I am not sure what I could remove. If I run limit_mem() just before #57 I get tensorflow errors for missing variables:

FailedPreconditionError (see above for traceback): Attempting to use uninitialized value convolution2d_1_W
	 [[Node: convolution2d_1_W/read = Identity[T=DT_FLOAT, _class=["loc:@convolution2d_1_W"], _device="/job:localhost/replica:0/task:0/gpu:0"](convolution2d_1_W)]]
	 [[Node: Mean_23/_77 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2230_Mean_23", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

It seems that tensorflow thinks that bn_model still needs its GPU memory and so there is no more GPU ram for final_model.

rteja1113 · June 23, 2017, 4:23am

Hi @gai, the models in part-1 are compatible with theano backend only.

gai · June 23, 2017, 4:24am

Oh, so I configured keras incorrectly to use tensorflow?
I am retrying now using the ~/.keras/keras.json described here:

gai · June 23, 2017, 3:29pm

Ok, I just saw that the notebook ran through now when using the Theano backend. Thanks a lot!

bicepjai · August 1, 2017, 6:08am

This was helpful. thanks