Hey guys. Finally got a learner up and running on a structured dataset I’ve been working on. When running the LR_finder, it seems to show a tiny window where the loss is improving before flatlining. Given the fact this seems to imply there is only a very small range of effective learning rates, I wanted to see if anyone had advice with respect to the implications of using cyclical learning rates in this circumstance. See the attached code screenshot. Happy to provide more details or code as needed. Thanks in advance for any tips
the learning rate plot does not show an increase in the loss at higher learning rates.
You could try to extend the range like this for example to 1000:
Based on the shown plot you should be able to use a much higher learning rate?
Or you simply try out a higher learning rate (and post the results)?
Maybe somebody else has some suggestions?
ok great i’ll try that out and post results @MicPie
Also I have tried out a large variety of batch sizes and the only one that works seems to be a batch size of 32.
I’m wondering if this is because my data is sorted going into the model data loader and there isn’t enough variety in a given mini-batch and simply due to luck, batch size of 32 seems to produce more variety throughout the epoch. What’s the best way to shuffle that data in structured datasets. I know i’ve seen that feature in some of the image data code but I’m not seeing it as an argument in structured.py
Hmmm. Well after extending the LR to 1000 the curve is still flat: I’ll try 10,000 and report back. Definitely seems like either something is wrong or the window of effective learning rates is extremely small.
Ok so instead of investigating larger learning rates, now i’m trying to zoom in on the curve itself from 1e-10 to 1e-4 which shows a more normal curve:
Notice that your loss is actually a multiple of 1e9 or 1e8 (see the number in the top left corner of your chart). Maybe investigate why your loss is so huge to start with!
So I’ve had a breakthrough in my understanding of what is going on here and a lot of it was my own lack of understanding of how lr_find() is working. My first mistake was using the default parameters because i didn’t realize that using the lr_finder creates a SCHEDULE of learning rates not just picking the largest possible learning rate as described in the video. So it was extremely clear to me that I should pick the largest possible learning rate where the loss is still decreasing. What was unclear was that the rest of the curve would be used as the schedule for decreasing the learning rate for CLR. Therefore, when i zoomed in to 1e-7 and 1e-4, I started to get much better and more consistent results than when I used the default sweep if that makes sense.
With respect to the size of the loss, it is consistent with the scale of my target variable which is between negative tens of millions and high positive tens of millions. As far as i could tell there wasn’t an advantage to scaling the target, just the inputs but perhaps I have misunderstood?
I will say that I’ve been working on a structured learner for a few months and the shape of your lr curve looks very familiar. I messed with cyclical learning rates and they didn’t screw anything up, but they also didn’t seem to help so I have just been using static rates.
@Ralph I recommend altering your LR finder code to zoom in on the sharp curve. It has made a big difference on my end now and is performing better than static learning rates. just specify a beginning and ending point like so:
m.lr_find(start_lr= 1e-7, end_lr=1e-4)
Interesting that the lr_find() had a sharp elbow about 1.2e-5 but lr_find(start_lr= 1e-8, end_lr=1e-4) seems to go horizontal at about 5e-7. I may experiment with some tiny lr fine tuning on my currently trained model to see if I can shave off any more error.