Lesson 1: my experiments with resnet34 and some questions

balnazzar · February 3, 2018, 3:28am

I posted what follows in “Lesson 1 discussion”, but that thread seems to be a bit dead, so I considered worthwhile to open a new thread.

If this infringes any rule, please let me know and remove the present thread. Thanks.

I’m playing with the resnet34 and, like Jeremy suggested, running some experiments on various datasets.

Now, I was feeding it Kaggle’s monkey species dataset, multi-categorical. As you may see from shape, pretty large ( == feature rich) pictures (although they are rescaled to 224). No test set, just a validation set.
.
monkey

Let’s see what happens here:

monkeys1

Note that, over 15 epochs:

at the last epoch, the model still underfits (trn loss is still decreasing). Is that normal??
we attain tremendous accuracy almost immediately.
the accuracy stops improving (hard to improve over 1) and yet the validation loss is still decreasing at the last epoch. Why’s that??

But the question I’d really like you to answer regards the learning rate. Let’s run the finder:

lrate

As you may see, the loss is still plummeting at the very right of the graph, suggesting we should test even larger LRs.

I’d like to know:

How does fastai library selects the interval where it tests the learning rate vs. loss? How can I change the interval?
What does the n_skip parameter do?
No matter the graph above, LRs over 10^-1 do seem unreasonable. What do you think about this?
Last but absolutely not least, loss in the range of ~2.x is not congruous.

Thanks a lot.

jeremy · February 4, 2018, 3:39am

The ‘discussion’ threads were the in-class discussions during the in-person course. So yup they’re pretty dead! The ‘wiki’ threads are the more active ones now. Having said that, there’s no rules, and it’s really up to you to decide whether you want to start a new thread, on piggy back an existing one. In this case, you’ve got a lot of thoughts and questions, so it seems rather “thread worthy” to me…

(Very minor: It might be better to use numbered lists with lots of questions so we can refer to them more easily.) I’ll refer to these in order:

Yup 15 epochs isn’t that many - it depends on your dataset size, amount of dropout, batch size, and more as to how many epochs you’ll need to get your best fit
These images are similar to what you’ll find in imagenet, so it’s well tuned for that. I’m surprised you get 100% however. Is there a chance your validation and test sets overlap?
The loss is cross-entropy loss. As you’ll learn later in the course, more confident predictions, if correct, get betters scores. So your model is becoming more confident, whilst maintaining accuracy

Happy to answer. But probably more useful for you for me to help you answer your own question. Do you remember how to view the source of a function? Try using it to dig into lr_find (and the functions it calls) to see if you can find the right source code. Or (maybe even better) try using the python debugger inside the notebook. Once you’ve found the right spot, I think you’ll be able to answer this yourself…
See (1)
LRs over 0.1 are totally possible. Depending on the optimizer, they can be up to 10 or more!
Based on the shape, I think you have a small dataset. Which means it ran very few batches, so didn’t train much. The LR finder doesn’t work great on small datasets (at least, as written - it could easily be improved to run more epochs, and PRs would be welcome to implement this. It would be an excellent first fastai contribution project and I’d be happy to help). The issue is that it doesn’t run enough batches to be very helpful. Try making the batch size 8 or so to make it a bit better

(Once you’ve got more batches going through, the loss will have more of a chance to come down to the levels you’re seeing in your training)

balnazzar · February 4, 2018, 9:35pm

@Jeremy, I imagine you are a pretty busy guy, with a lot of stuff to boil down every day. It is truly wonderful that you find time to assist every one of us, with our noob questions, for free, and in detail.
I think I’m speaking for everyone here.

That said, coming to us… The parts of your answers I don’t address, I consider pretty clear.

Copy that

I admit I didn’t investigate that directly, but I’m inclined to say no, since it’s an official Kaggle dataset. However, they could very well have had them overlapping, just to see how one manages such an issue

Right. It will be instructive. I used ?? every now and then (I should do it more often), but I never used the integrated debugger.

The train set has just 1098 images, over 10 categories, though… I selected that dataset on purpose: I wanted to discover how such a refined model would perform on a small dataset with just ~100 training examples per category.

Mh, good to know.

I hoped to be able to make contributions, no matter how small, once (and if) I would ahve attained more maturity, both in DL and in python development.
But if you (or other senior fastai-ers) say that you would guide me, I’d be more than happy!

Copy that too.

Let me think a bit about it, I’ll refer to this thread once I encounter problems (that almost surely I’ll have).

Sorry if my counter-reply was 18h delayed. I had a very messed sunday, and if I’m not making mistakes, there are 9 hours between west coast and central europe timezones.
Again, thanks.

jeremy · February 5, 2018, 4:03am

My suggestion: don’t delay, get started now. You’ll learn a lot from the process, as long as you don’t mind fighting your way through plenty of challenges along the way…

balnazzar · February 5, 2018, 9:19am

I already started. Thanks!

@jeremy:
n_skip is crystal clear now. Furthermore, your advice of selecting a smaller batch size (16) worked. Indeed:

Capturej

Note that one has to select values around 10^-2 or a bit smaller to obtain best results. If you select 10^-1 the SGD does not converge at all.

Why’s that? In the end, with 10^-1 we chose a LR where the loss is still rapidly decreasing.

Coming to the main issue: Improving the LR finder.

I’m looking at how lr_find and LR_Finder are intermingled. sgdr is very interesting, but quite complex, with all those callbacks and classes passing themselves all around. Let us leave LR_Finder alone for a moment, and focus on getting the finder run more epochs in the simplest manner.

A rather dull method could be the following: add one more parameter and a for loop inside lr_find, for example:

    def lr_find(self, start_lr=1e-5, end_lr=10, wds=None, linear=False, nepochs=1):
        self.save('tmp')
        for i in np.arange(nepochs):
            layer_opt = self.get_layer_opt(start_lr, wds)
            self.sched = LR_Finder(layer_opt, len(self.data.trn_dl), end_lr, linear=linear)
            self.fit_gen(self.model, self.data, layer_opt, 1)
        self.load('tmp')

This is obviously wrong, since it doesn’t run more epochs: it just starts from the beginning every time.

So I decided to cheat. To what function we passed the number of epochs before? To fit(). Thus, I went looking about how fit() does more epochs at once, to use it as a model.

It turns out that the argument “epochs” corresponds to n_cycles parameter. The only place into fit() where it gets used is when it calls fit_gen(). Plus, fit_gen() is called by lr_find() immediately after LR_Finder. I thought I nailed it, but I was wrong.

Now, fit_gen() is quite complex on itself (and, er…, not much commented…), but I think you get it to do N epochs by specifying n_cycle=N. At least, we do that when we call fit().

So, I did this:

def lr_find(self, start_lr=1e-5, end_lr=10, wds=None, linear=False, nepochs=1):
    self.save('tmp')
    layer_opt = self.get_layer_opt(start_lr, wds)
    self.sched = LR_Finder(layer_opt, len(self.data.trn_dl), end_lr, linear=linear)
    self.fit_gen(self.model, self.data, layer_opt, nepochs)
    self.load('tmp')

I just passed n_cycle, which you fixed at 1, as a new argument, nepochs.

But that darn contraption does hang immediately after completing the first epoch:

I’m having difficulties in making sense of that behaviour. I mean, both fit() and lr_find now call fit_gen() with the same arguments, that is:

self.model, self.data, layer_opt, n_cycle

Why does the latter hang whilst the former doesn’t? I’m flabbergasted.

I think I need one hint or two.

balnazzar · February 12, 2018, 8:57pm

@jeremy

Ok, I think I did it.

Let me run some additional tests, then I’ll post my patched code, and if you find time to revise it and it’s correct, you could perhaps incorporate it.

Thanks for having encouraged me, it was instructive

jeremy · February 13, 2018, 1:33am

Awesome! Feel free to add comments to the code as appropriate based on whatever bits you found most necessary I haven’t checked the code, but I’m guessing the issue was here:

self.sched = LR_Finder(layer_opt, len(self.data.trn_dl), end_lr, linear=linear)

That second param presumably needs to be multiplied by # epochs?

balnazzar · February 13, 2018, 4:18pm

@jeremy: yes, it is the easiest way to do it, though I (guiltily) needed a while to figure it out

One could patch the code (and comments) as follows:

 def lr_find(self, start_lr=1e-5, end_lr=10, wds=None, linear=False, run_for=1):
        """Helps you find an optimal learning rate for a model.

         It uses the technique developed in the 2015 paper
         `Cyclical Learning Rates for Training Neural Networks`, where
         we simply keep increasing the learning rate from a very small value,
         until the loss starts decreasing.

        Args:
            start_lr (float/numpy array) : Passing in a numpy array allows you
                to specify learning rates for a learner's layer_groups
            end_lr (float) : The maximum learning rate to try.
            wds (iterable/float)
            run_for (Int) : the number of cycles we want to run the finder over.

        Examples:
            As training moves us closer to the optimal weights for a model,
            the optimal learning rate will be smaller. We can take advantage of
            that knowledge and provide lr_find() with a starting learning rate
            1000x smaller than the model's current learning rate as such:

            >> learn.lr_find(lr/1000)

            >> lrs = np.array([ 1e-4, 1e-3, 1e-2 ])
            >> learn.lr_find(lrs / 1000)

        Notes:
            lr_find() may finish before going through each batch of examples if
            the loss decreases enough.

        .. _Cyclical Learning Rates for Training Neural Networks:
            http://arxiv.org/abs/1506.01186

        """
        self.save('tmp')
        layer_opt = self.get_layer_opt(start_lr, wds)
        self.sched = LR_Finder(layer_opt, run_for*len(self.data.trn_dl), end_lr, linear=linear)
        self.fit_gen(self.model, self.data, layer_opt, run_for)
        self.load('tmp')

A further param has been added: run_for (in order to avoid confusion with other stuff, I deliberately avoided naming it epochs, cycles, etc).

The docstrings have been edited accordingly.

It basically works, but the red bar is still displayed, which shouldn’t be:

Indeed, it is may understanding that the red bar (dangerbar in tqdm parlance) indicates an error. But we have no errors: the finder quits just because an established condition about loss has been successfully met.

By the way, note that just an half epoch more allowed it to reach a satisfactory loss/accuracy, in contrast with the first epoch where they were both disastrous and useless for LR discovering purposes.

One solution could simply be to patch the part of the code where one calls tqdm to display a dangerbar.

But I was wondering if we could do more extensive work. Prior to run amok and start proposing fancy modifications to the code, I’d like to hear you opinion.

lr_find (which belongs to learner.py) calls LR_find and then fit_gen (sgdr and learner respectively).

fit_gen, in turn, calls fit. It’s not the learner’s fit, but the model’s one. The part we are interested in is the following:

fit()

as an epoch finishes, fit() interrupts the execution and returns or keeps going on depending on what answer it gets from cb.on_batch_end(debias_loss):

on_batch_end()

Now, the second if is triggered when it discovers a learning rate which produces a loss lesser than the desired threshold, and that’s that and it’s ok.

The first if is the one that causes cb.on_batch_end(debias_loss) to return (without our modification above) even if nothing good has been discovered.
It should not always return True.
If one specifies run_for = N, N > 1, it should keep running along until a certain condition regarding loss is reached, or the specified number of epochs had been run.

What do you think about all this atrocious babbling of mine ?