When I am trying to use fp16 training in Lesson 3 notebook (planet dataset), the kernel dies. Not sure why it happens, if it is something related to my setup or drivers. I guess similar problems are discussed in this thread. In my case, I am getting KeyboardInterrupt
exception but I guess it could be something else on other machines:
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-17-be1ab4476b35> in <module>
----> 1 learn.fit_one_cycle(5, slice(lr))
~/code/fastai_v1/repo/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
20 callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
21 pct_start=pct_start, **kwargs))
---> 22 learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
23
24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):
~/code/fastai_v1/repo/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
160 callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
161 fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162 callbacks=self.callbacks+callbacks)
163
164 def create_opt(self, lr:Floats, wd:Floats=0.)->None:
~/code/fastai_v1/repo/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
82 for xb,yb in progress_bar(data.train_dl, parent=pbar):
83 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
85 if cb_handler.on_batch_end(loss): break
86
~/code/fastai_v1/repo/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
28 opt.step()
29 cb_handler.on_step_end()
---> 30 opt.zero_grad()
31
32 return loss.detach().cpu()
~/code/fastai_v1/repo/fastai/callback.py in zero_grad(self)
42 def zero_grad(self)->None:
43 "Clear optimizer gradients."
---> 44 self.opt.zero_grad()
45
46 #Hyperparameters as properties
~/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/optim/optimizer.py in zero_grad(self)
161 if p.grad is not None:
162 p.grad.detach_()
--> 163 p.grad.zero_()
164
165 def step(self, closure):
KeyboardInterrupt:
Basically, nvidia-smi
shows that GPU is used, and memory consumption is about half of total memory. Then it just fails.
The only change I’ve introduced into original notebook’s code is to_fp16
call:
learn = create_cnn(data, arch, metrics=[acc_02, f_score]).to_fp16()
Update: Also I am getting this error:
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch-nightly_1541148374828/work/aten/src/THC/generic/THCTensorCopy.cpp:20
It was already mentioned in this thread as well:
Seems that mixed-precision training is a bit broken. Or probably something with drivers? Do we need to re-build PyTorch from sources to solve these issues? I am using PyTorch with CUDA 9.2 and 410 driver.