As Jeremy repeats in the lectures, I have data, I have an architecture, and I have a loss function.
However, after running quite a few iterations with different hyperparameters, I can’t improve my loss beyond a certain point.
Are there any heuristics to understand whether the problem is because there’s too little data to learn from, whether the architecture is limiting, or whether the data actually contains no useful information?
Phrased differently, if your loss values stops improving, how do you know where to spend your time investigating?
That is a good question.
To start your investigation, I would recommend to analyze both your training and validation losses to determine if you are underfitting or overfitting (Underfitting vs overfitting).
Then when diagnosis done, you have 2 conceptual directions to explore:
- Underfitting -> model needs more effective representational capacity (to simplify : more trainable parameters)
- Overfitting -> model needs more regularization (https://www.deeplearningbook.org/contents/regularization.html)
I hope it helps.
@nwf This question reminded me of the “Fast/Good/Cheap, can pick only 2” .
Another dimension that would be worth consider is “cost”.
If acquiring more labeled data is expensive (in $ or time), compared with experimenting with architectures, etc. The best heuristic then will be to try the lease expensive option first…
Regarding the underfit/overfit problem, the model seems to start off with equal reductions in loss for both training and validation sets, and then after 8 epochs or so, begins to overfit to the training data. What I’ve found is that even the overfitting seems to cap out eventually. Also, thanks for the links, I’m reading through the chapter on regularization now.
As for data, I’m always trying to add more data to the collection. The problem is that even doubling or tripling the items still leaves me with a comparatively small amount of data. I currently only have around 7,000 items.
@nwf Are you training from scratch or using Transfer Learning?
From scratch. The data is not in any format that commonly available pretrained networks have been made for. It’s a very specific deconstruction of music waveforms.
I’m using a modification of the language learner structure from lesson 10, basically an LSTM followed by a PoolingLinearClassifier. I experimented with adding dropouts of varying percentages, but I surely did something incorrect because I quickly went from a BinaryCrossEntropy loss of around 0.55 northwards to 15 and beyond.