Lesson 3 Advanced Discussion ✅

rachel · November 9, 2018, 2:28am

This is a place to talk about more advanced or tangential topics related to the Lesson 2 lecture. This will not be monitored during class, but we will read it afterwards.

Feel free to discuss anything you like, as long as it’s at least somewhat related to what’s happening in class.

champs.jaideep · November 9, 2018, 2:35am

here is adv q on LR
Many times i get LR curve of this pattern

Decreasing from top (expected)
2)after reaching say low it rises a bit… then goes flat for many iteratons
rising a bit,coming down,not yet rocketting
after many iterations finally it rockets

from which part of curve we chose the LR here
flat region for many iteration,decreasing sharp

erikg · November 9, 2018, 3:19am

In previous courses, one of the steps was to take the complete dataset and resize everything to something like 256, and have a separate dataset for maybe 512 later on. For planets and camvid, this no longer seems to be the case.

Is this because of the new transformation api? Don’t we still have the overhead of opening large images? I also notice the transform library uses PIL and not opencv. Should we be using PIL simd? What are the best practices now?

sequoia_kings · November 9, 2018, 3:30am

For the resnet architectures, how do we know if the loss function for each is convex or non-convex? If they are nonconvex, does fast.ai automatically run multiple different starting points?

devforfu · November 9, 2018, 3:31am

When I am trying to use fp16 training in Lesson 3 notebook (planet dataset), the kernel dies. Not sure why it happens, if it is something related to my setup or drivers. I guess similar problems are discussed in this thread. In my case, I am getting KeyboardInterrupt exception but I guess it could be something else on other machines:

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-17-be1ab4476b35> in <module>
----> 1 learn.fit_one_cycle(5, slice(lr))

~/code/fastai_v1/repo/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     20     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     21                                         pct_start=pct_start, **kwargs))
---> 22     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     23 
     24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/code/fastai_v1/repo/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162             callbacks=self.callbacks+callbacks)
    163 
    164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/code/fastai_v1/repo/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break
     86 

~/code/fastai_v1/repo/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     28         opt.step()
     29         cb_handler.on_step_end()
---> 30         opt.zero_grad()
     31 
     32     return loss.detach().cpu()

~/code/fastai_v1/repo/fastai/callback.py in zero_grad(self)
     42     def zero_grad(self)->None:
     43         "Clear optimizer gradients."
---> 44         self.opt.zero_grad()
     45 
     46     #Hyperparameters as properties

~/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/optim/optimizer.py in zero_grad(self)
    161                 if p.grad is not None:
    162                     p.grad.detach_()
--> 163                     p.grad.zero_()
    164 
    165     def step(self, closure):

KeyboardInterrupt:

Basically, nvidia-smi shows that GPU is used, and memory consumption is about half of total memory. Then it just fails.

The only change I’ve introduced into original notebook’s code is to_fp16 call:

learn = create_cnn(data, arch, metrics=[acc_02, f_score]).to_fp16()

Update: Also I am getting this error:

RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch-nightly_1541148374828/work/aten/src/THC/generic/THCTensorCopy.cpp:20

It was already mentioned in this thread as well:

Seems that mixed-precision training is a bit broken. Or probably something with drivers? Do we need to re-build PyTorch from sources to solve these issues? I am using PyTorch with CUDA 9.2 and 410 driver.

paws · November 9, 2018, 3:34am

What do you mean by, “if the loss function for each is convex or non convex?”

sequoia_kings · November 9, 2018, 3:44am

For our simple example, the loss function is (y(act)-y(pred))^2 = (y-Ax)^2. If you plot this function, you’ll see that it has one global minimum (and not a series of a bunch of minimums). So, once you see that your gradient is close to zero, you know you’ve found the one and only minimum. A convex function has a global minimum (and the Hessian is positive semi-definite). However, there are functions that are non-convex and have multiple minimums. Think about a sine wave or various polynomial functions. In these cases, you can get “stuck” in a local minimum that may be much worse than the true overall minimum. In this context, if your parameters are from a poor local minimum, there are much better parameters out there in the global minimum.

sequoia_kings · November 9, 2018, 3:47am

Image result for nonconvex

Here’s an example of what a nonconvex funciton with lots of local minima looks like.

wdhorton · November 9, 2018, 3:49am

Optimizing neural networks is generally a non-convex problem. There’s a paper somewhere that basically showed you never find the global minimum, but there are many local minimums that are very close to it so they’re “good enough”.

wdhorton · November 9, 2018, 3:50am

Found the paper I was thinking of, it’s here: https://arxiv.org/abs/1412.0233

sequoia_kings · November 9, 2018, 4:00am

Thanks, will have to check that out!

jcatanza · November 9, 2018, 4:02am

Thanks @wdhorton that’s very interesting. Explains why larger networks are advantageous in getting close to the global minimum.

pbanavara · November 9, 2018, 4:06am

Other than medical imaging, what are some of the practical use cases for image segmentation. Jeremy mentioned self driving cars - but I can’t imagine the effort that goes to do pixelwise labelling of millions of images from SD car cameras. Is there a way to fast track the labelling process ?

wdhorton · November 9, 2018, 4:07am

Another practical use case of image segmentation: seismic imagery https://www.kaggle.com/c/tgs-salt-identification-challenge. Also it’s commonly used for satellite imagery as well.

krash · November 9, 2018, 4:09am

What about loss function for multi label classification? Does the same loss function ( cross entropy) work for multi label classification as well?
In Keras , I use binary cross entropy+sigmoid for multi label, it’s not clear how fastai takes care of this

wdhorton · November 9, 2018, 4:11am

Sorry, I answered in the other chat before I saw you posted here:

jcatanza · November 9, 2018, 4:11am

Cross-entropy is a generalization of binary logloss for multiple (>2) classes.

miwojc · November 9, 2018, 4:45am

is leaky relu used more than relu?

harikrishnanrajeev · November 9, 2018, 5:06am

Once an image is segmented, is there a way to identify the coordinates of the segmented part ?.

SHAR1 · November 9, 2018, 6:54am

Mainly in lstms, I think its because of the vanishing gradient problem. I don’t have any reference to back up my argument. It’s just a practice that I observed.