I am absolutely fascinated by the idea of the universality theorem as introduced by Jeremy. I intended to use that to count cars in a parking lot, without neither describing the concept of counting, nor what cars looks like. To approach this incrementally I created a “unit test” first, just count rectangles with synthetic and homogenous shapes. This works great: Notebook here.
But I also observe that the training loss is greater than the validation loss, basically all the time.
This seems to be a recurring theme. We have a couple of threads here that I read and I also looked into the Disciplined Approach to Neural Network Hyper Parameters - paper by Leslie Smith.
I still do not understand why this happens in my case. Details are in the Notebook referenced above, but the gist in this plot as well:
How can it be at all that the loss on trained examples is higher than when the model is applied to previously unseen examples?
The two suggestions from the threads and Jeremy’s lectures are to (a) reduce the learning late or (b) to train longer.
I don’t understand how this would be applicable to the situation at hand?
With respect to (a) I am using fit_one_cycle() which should limit the max learning rate to something effective and I also already use a low learning rate. The training and validation loss both flatten out and don’t have much variance anymore.
Also (b) training longer, I tried that, does not change the picture. If anything changes than that the validation loss is going up at some time.
Am I missing something here?