Wiki: Lesson 2

YJP · February 11, 2018, 2:57am

Hello,

In your code, there are three places to modify:

log_preds, y = learn.TTA(is_test=True)
probs = np.mean(np.exp(log_preds), axis=0)
accuracy_np(probs, y), metrics.log_loss(y, probs)

If you search for the error message you get, it has been discussed somewhere in the forum and you will usually find an answer there. Hope this helps.

nikhil.ikhar · February 11, 2018, 11:49am

I have a question. I m trying to solve multi-class prediction of kaggle fruits-classification data set.
Data screen shot

I have tried to classification as explained in lesson 1 & lesson 2. But my approach is failing. My notebook.

First I keep lr= 0.1 and n_cycle=3. I m not clear if this training is resulting in overfitting.

Next, I keep lrs=0.01, n_cycle=3. Error looks comparable.

Here y should be 2D array as explained in lesson2. I’m getting 1D array.

As told in lesson 1, I try to visualize my prediction. And, it is incorrect at many places,

Any help on why I m going wrong will be really helpful.

-Nikhil

balnazzar · February 11, 2018, 12:14pm

@jeremy chose satellite imgs just to point out that kind of qualitative difference.

When you trained the earliest layers of Resnet over your dogs/cats dataset, you did set TWO orders of magnitude difference with respect to the last two layers (the ones added by us). That was because you did NOT want to spoil those layers’ weights: dogs and cats are very similar to Imagenet’s images over which they were laboriously tuned over.

The same holds, to a lesser extent, for the middle layers (one order of magnitude): the ones that recognize slightly more complex patterns.

Now, you got to unfreeze and train earlier and middle layers over images that are more qualitatively different from imagenet’s images, so you got perturb them a lot more.

Indeed, if lr = 10^-2 is the learning rate of late layers, lr/9 is a lot MORE than lr * 10^-2 ( = 10^-4), and the same stands for 10^-3 vs lr/3 (that is, 10^-2 * (1/3) ).

The gist is that the more you have to cope with images qualitatively different from those used for pretraining, the more you have to be strong on trying higher learning rates.

Let us know whether this helps

Pomo · February 12, 2018, 12:54am

Hi all. I had a couple of points of uncertainty after Lesson 2 that I would like to clear up before moving on to Lesson 3. (Where they may well be answered.)

I understand that data augmentation takes each training image and applies a random visual transformation to it before using it to train. When augmentation is specified in get_data and precompute is true, does fit() then automagically refrain from applying augmentation and rather precompute activations using the original images? Or is augmentation applied to each training image once to precompute the activations used to train the last layer? The former makes more sense.
The pre-trained resnext in its last layers takes a large number of activations (features) and maps them to a thousand category activations, which are in turn scaled by softmax into a probability distribution across the categories. As I understand it so far.

Our dog breed classification problem starts with resnext, freezes most of it, and classifies images across 120 categories. I see this process described variously as adding a layer, as retraining the last layer, and as retraining the penultimate layer.

What exactly is happening here? Are we training a new layer that takes the thousand ImageNet category raw activations and reduces them to 120 breeds, followed by softmax? Or are we replacing resnext’s one thousand category (outputs) layer with one that maps the same incoming activations down to 120 breed categories, and applies softmax? The latter, I hope, otherwise I am quite confused.

Thanks for clarifying!

wespiser · February 12, 2018, 4:06am

No, I don’t think this is over fitting, which would be defined as a continued lowering of training set loss, which causes an inverse increase in validation set loss. The mental “model” for over fitting is that the model learns the examples in the training set to the determent of generalization. Hope this helps! Adam

nikhil.ikhar · February 12, 2018, 10:16am

Thanks, @wespiser. This is a new perspective. My mental model was if Validation loss was greater than training loss it can be a case of overfitting. I was not sure how much validation loss should be greater to say it is overfitting.

Now I have got my notebook working.. It seems to be correct apart from printing images for the wrong classification.
One of my own doubt was why y is 1D array. When we are reading data from csv the give us 2D array. When we read from file system we get a single array.

This small project helped me to learn a lot.

Edit: Another thing to mention is that, using fast I got accuracy of 0.986533717. In comments they have mentioned accuracy from 88% to 98%.

Thanks.

leehagoodjames · February 12, 2018, 9:34pm

I am listening to the explanation of the learning-rate-finder, and I wonder if there is a good theory on how the magnitude of the gradient changes across its domain?

Given that this learning-rate-finder starts with such a high loss, it seems that we are moving through the domain of the gradient as we vary the learning rate. Therefore it would seem that this program would only have validity if the overall magnitude of the gradient was relatively constant throughout its very large space/domain. Is there any evidence that suggests that the magnitude of the gradient is anywhere close to constant throughout its domain?

I have drawn a picture to provide an attempt to explain my question with a loss-function that varies only in one dimension. My questions is, do loss functions ever have shapes like this such that the magnitude of the gradient varies across the domain?

wespiser · February 13, 2018, 6:33am

We are not moving through the domain of the gradient for the lr_find() method, we are moving loss space.
What the lr_find() method’s domain, or input, is a learning rate, and it’s range, or output is a loss, calculated using the neural network and a few epochs. The idea is that we just want to find a range of useful learning rates to plug into our cyclical learning rate algorithm.

leehagoodjames · February 13, 2018, 11:43pm

As it was described and to my understanding, the weights of the network are not reset as the learning rate is incrementally increased. Therefore, the network’s weights and biases are different for each chosen learning rate. Since the gradient is a function of the weights and biases (its inputs/domain), changing the weights and biases inherently moves through the space/domain of the gradient, because the weights/biases at the start of each learning rate increment are different. (Unless they are reset each time the learning rate increases)

wespiser · February 14, 2018, 3:33am

So here’s the paper: https://arxiv.org/pdf/1506.01186.pdf

The important equation is
Weight[t + t] = Weight[t] - learning_rate * gradient(loss)

And section 3.3:

There is a simple way to estimate reasonable minimum
and maximum boundary valueBoth the learning rate, and weights Both the learning rate, and weights Both the learning rate, and weights s with one training run of the
network for a few epochs. It is a “LR range test”; run your
model for several epochs while letting the learning rate in-
crease linearly between low and high LR values. This test
is enormously valuable whenever you are facing a new ar-
chitecture or datase

So, to answer your question, this measure is a heuristic, and doesn’t have any theoretical basis, not yet at least. No one really knows what the shape of of the loss function is. There’s some evidense that most deep neural networks have one global minimum, even though they are incredibly high dimensional space.

I would also look up Stochastic Gradient Descent, its a pretty simple algorithm and we lack a satisfying answer for why it works so well for convolutional neural networks!

balnazzar · February 14, 2018, 2:35pm

Mh, but that’s just the plain gradient descent update equation

wespiser · February 15, 2018, 2:21am

Well, it looks like the learning rate is scaling linearly with iteration, or at least I think that’s how this undocumented callback is working:

github.com

fastai/fastai/blob/541ef2f2f91838b395b4b3d1029a545918c51831/fastai/sgdr.py#L83


    self.linear = linear
    ratio = end_lr/layer_opt.lr
    self.lr_mult = (ratio/nb) if linear else ratio**(1/nb)
    super().__init__(layer_opt)


def on_train_begin(self):
    super().on_train_begin()
    self.best=1e9


def calc_lr(self, init_lrs):
    mult = self.lr_mult*self.iteration if self.linear else self.lr_mult**self.iteration
    return init_lrs * mult


def on_batch_end(self, loss):
    if math.isnan(loss) or loss>self.best*4:
        return True
    if (loss<self.best and self.iteration>10): self.best=loss
    return super().on_batch_end(loss)


def plot(self, n_skip=10):
    plt.ylabel("loss")

prusswan · February 15, 2018, 3:47pm

Trying to run the code in lesson2-image_models.ipynb just to see how it works, but it appears to be very slow

At sz=128,

learn.unfreeze()
learn.fit(lrs, 3, cycle_len=1, cycle_mult=2)

has these time estimates [1:48:10<10:49:01, 6490.32s/it], so this part is expected to finish in 10+ hours.

What else is missing? Also, there is hardly any GPU load while running this code, is it possible that the notebook kernel is not using the GPU correctly?

Edit: Solved by restarting the kernel, now it completes within minutes.

balnazzar · February 15, 2018, 4:38pm

something is definitely not right, but in absence of further information, I can hardly offer a cogent opinion.

balnazzar · February 15, 2018, 4:38pm

I think so, as well.

prusswan · February 15, 2018, 5:36pm

I restarted the kernel and it is much faster now. But the cause is still unknown for the problem (ConvLearner running very slowly or not using the GPU properly)

balnazzar · February 15, 2018, 8:21pm

Try and ask torch whether it sees the GPU, E.g.:

print(torch.cuda.is_available(), torch.backends.cudnn.enabled, 
      torch.cuda.device_count(),
      torch.cuda.current_device(), 
      torch.cuda.get_device_capability(0), 
      torch.cuda.get_device_name(0))

jk23541 · February 23, 2018, 8:16am

At around the 1 hour 30 min mark of the video, I saw this when the video was talking about making a histogram for the heigh of images and width of images: row_sz, col_sz = list(zip(*size_d.values()))

Can someone explain what is happening here and how 2 different values are able to be assigned?

Also in this line of code in the get_data function: return data if sz>300 else data.resize(340, ‘tmp’), what is the ‘tmp’?

And finally, how exactly does the accuracy(log_preds, y) function work?

pkanade · February 26, 2018, 4:49pm

Visiting back to the question I asked and reply by you, I understood what you were trying to convey.

paultimothymooney · February 27, 2018, 4:39am

When running the Lesson 2 notebook on Crestle, I got the following error: “FileNotFoundError: test-jpg folder doesn’t exist or is empty” after running the following cell: “data = get_data(256)”. I fixed this error by adding the following code into the 4th cell near the top of the script: “!ln -s /datasets/kaggle/planet-understanding-the-amazon-from-space/test-jpg {PATH}”.

Hopefully this is helpful in case if anyone else is having the same issue.