Lesson 3 Advanced Discussion ✅

When I am trying to use fp16 training in Lesson 3 notebook (planet dataset), the kernel dies. Not sure why it happens, if it is something related to my setup or drivers. I guess similar problems are discussed in this thread. In my case, I am getting KeyboardInterrupt exception but I guess it could be something else on other machines:

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-17-be1ab4476b35> in <module>
----> 1 learn.fit_one_cycle(5, slice(lr))

~/code/fastai_v1/repo/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     20     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     21                                         pct_start=pct_start, **kwargs))
---> 22     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     23 
     24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/code/fastai_v1/repo/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162             callbacks=self.callbacks+callbacks)
    163 
    164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/code/fastai_v1/repo/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break
     86 

~/code/fastai_v1/repo/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     28         opt.step()
     29         cb_handler.on_step_end()
---> 30         opt.zero_grad()
     31 
     32     return loss.detach().cpu()

~/code/fastai_v1/repo/fastai/callback.py in zero_grad(self)
     42     def zero_grad(self)->None:
     43         "Clear optimizer gradients."
---> 44         self.opt.zero_grad()
     45 
     46     #Hyperparameters as properties

~/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/optim/optimizer.py in zero_grad(self)
    161                 if p.grad is not None:
    162                     p.grad.detach_()
--> 163                     p.grad.zero_()
    164 
    165     def step(self, closure):

KeyboardInterrupt:

Basically, nvidia-smi shows that GPU is used, and memory consumption is about half of total memory. Then it just fails.

The only change I’ve introduced into original notebook’s code is to_fp16 call:

learn = create_cnn(data, arch, metrics=[acc_02, f_score]).to_fp16()

Update: Also I am getting this error:

RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch-nightly_1541148374828/work/aten/src/THC/generic/THCTensorCopy.cpp:20

It was already mentioned in this thread as well:

Seems that mixed-precision training is a bit broken. Or probably something with drivers? Do we need to re-build PyTorch from sources to solve these issues? I am using PyTorch with CUDA 9.2 and 410 driver.

4 Likes