Lesson 3 Advanced Discussion ✅

This is a place to talk about more advanced or tangential topics related to the Lesson 2 lecture. This will not be monitored during class, but we will read it afterwards.

Feel free to discuss anything you like, as long as it’s at least somewhat related to what’s happening in class.

here is adv q on LR
Many times i get LR curve of this pattern

  1. Decreasing from top (expected)
    2)after reaching say low it rises a bit… then goes flat for many iteratons
  2. rising a bit,coming down,not yet rocketting
  3. after many iterations finally it rockets

from which part of curve we chose the LR here
flat region for many iteration,decreasing sharp

In previous courses, one of the steps was to take the complete dataset and resize everything to something like 256, and have a separate dataset for maybe 512 later on. For planets and camvid, this no longer seems to be the case.

Is this because of the new transformation api? Don’t we still have the overhead of opening large images? I also notice the transform library uses PIL and not opencv. Should we be using PIL simd? What are the best practices now?


For the resnet architectures, how do we know if the loss function for each is convex or non-convex? If they are nonconvex, does fast.ai automatically run multiple different starting points?


When I am trying to use fp16 training in Lesson 3 notebook (planet dataset), the kernel dies. Not sure why it happens, if it is something related to my setup or drivers. I guess similar problems are discussed in this thread. In my case, I am getting KeyboardInterrupt exception but I guess it could be something else on other machines:

KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-17-be1ab4476b35> in <module>
----> 1 learn.fit_one_cycle(5, slice(lr))

~/code/fastai_v1/repo/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     20     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     21                                         pct_start=pct_start, **kwargs))
---> 22     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/code/fastai_v1/repo/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162             callbacks=self.callbacks+callbacks)
    164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/code/fastai_v1/repo/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break

~/code/fastai_v1/repo/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     28         opt.step()
     29         cb_handler.on_step_end()
---> 30         opt.zero_grad()
     32     return loss.detach().cpu()

~/code/fastai_v1/repo/fastai/callback.py in zero_grad(self)
     42     def zero_grad(self)->None:
     43         "Clear optimizer gradients."
---> 44         self.opt.zero_grad()
     46     #Hyperparameters as properties

~/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/optim/optimizer.py in zero_grad(self)
    161                 if p.grad is not None:
    162                     p.grad.detach_()
--> 163                     p.grad.zero_()
    165     def step(self, closure):


Basically, nvidia-smi shows that GPU is used, and memory consumption is about half of total memory. Then it just fails.

The only change I’ve introduced into original notebook’s code is to_fp16 call:

learn = create_cnn(data, arch, metrics=[acc_02, f_score]).to_fp16()

Update: Also I am getting this error:

RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch-nightly_1541148374828/work/aten/src/THC/generic/THCTensorCopy.cpp:20

It was already mentioned in this thread as well:

Seems that mixed-precision training is a bit broken. Or probably something with drivers? Do we need to re-build PyTorch from sources to solve these issues? I am using PyTorch with CUDA 9.2 and 410 driver.


What do you mean by, “if the loss function for each is convex or non convex?”

For our simple example, the loss function is (y(act)-y(pred))^2 = (y-Ax)^2. If you plot this function, you’ll see that it has one global minimum (and not a series of a bunch of minimums). So, once you see that your gradient is close to zero, you know you’ve found the one and only minimum. A convex function has a global minimum (and the Hessian is positive semi-definite). However, there are functions that are non-convex and have multiple minimums. Think about a sine wave or various polynomial functions. In these cases, you can get “stuck” in a local minimum that may be much worse than the true overall minimum. In this context, if your parameters are from a poor local minimum, there are much better parameters out there in the global minimum.


Image result for nonconvex

Here’s an example of what a nonconvex funciton with lots of local minima looks like.


Optimizing neural networks is generally a non-convex problem. There’s a paper somewhere that basically showed you never find the global minimum, but there are many local minimums that are very close to it so they’re “good enough”.


Found the paper I was thinking of, it’s here: https://arxiv.org/abs/1412.0233


Thanks, will have to check that out!

Thanks @wdhorton that’s very interesting. Explains why larger networks are advantageous in getting close to the global minimum.

1 Like

Other than medical imaging, what are some of the practical use cases for image segmentation. Jeremy mentioned self driving cars - but I can’t imagine the effort that goes to do pixelwise labelling of millions of images from SD car cameras. Is there a way to fast track the labelling process ?

Another practical use case of image segmentation: seismic imagery https://www.kaggle.com/c/tgs-salt-identification-challenge. Also it’s commonly used for satellite imagery as well.


What about loss function for multi label classification? Does the same loss function ( cross entropy) work for multi label classification as well?
In Keras , I use binary cross entropy+sigmoid for multi label, it’s not clear how fastai takes care of this

1 Like

Sorry, I answered in the other chat before I saw you posted here:


Cross-entropy is a generalization of binary logloss for multiple (>2) classes.

is leaky relu used more than relu?

1 Like

Once an image is segmented, is there a way to identify the coordinates of the segmented part ?.

Mainly in lstms, I think its because of the vanishing gradient problem. I don’t have any reference to back up my argument. It’s just a practice that I observed.