Some adventures with Movielens Mini Net, and need help.
Thanks for the clear lesson on embeddings and what happens under the hood. I am still astonished that machine learning can extract humanly meaningful patterns (embedding features) from data that seems unrelated to them. Having spent some time “feature engineering” for biology papers and stock trading, it’s truly remarkable that a computer can do this automatically. And perhaps better than an expert.
I decided to play around with the “Mini Net” from Lesson 5, and made some mistakes that could be instructive for us beginners. Jeremy’s output function is
return F.sigmoid(self.lin2(x)) * (max_rating-min_rating+1) + min_rating-0.5
This scrunches the range of outputs into (0,5.5), .5 points above and below the range of actual ratings. Since Jeremy said this compression into the actual range makes it easier for the model to learn an output, I thought that making the task even easier might improve the results. The above function is symmetric around zero, so why not shift it to the center of ratings spread at 2.75, and put in a scaling factor that lets .5 and 5 map exactly to themselves? The final output would then more exactly correspond to the actual ratings when the linear layer was correct.
So I tried it, and the error got worse. Of course! A linear layer specializes in learning the best shift and scaling for the input to sigmoid. My doing it manually was just redundant. After playing around some more, I saw that the initial error was higher when shifted than when left at zero. This makes sense if the default initialization already generates outputs centered around zero. Shifting the sigmoid was actually causing the model to start at a worse place in parameter space.
Maybe the above is obvious, but I had to go through the experiments to “get it”.
Next, I tried varying the range of the sigmoid. Jeremy’s had allowed .5 point above and below the actual range. What if there is a better value for this padding of the range? It turns out that there is, I think, and the best value may even be negative. But after dozens of runs and comparisons, I realized I was caught up in the infamous “hyperparameter tuning” loop. There was no end to the experiments, and the whole process was starting to feel a bit obsessive. Yet…this padding value is merely a number k used in the model. Why can’t the model itself find an optimal value for k? Then I can sit back and watch while the GPU does the work that I had been doing manually.
So I tried to add k as a model parameter by reading docs and copying code examples. And was unsuccessful. k stays at its initial value. Would someone who is further along with fastai and Python please look at this Jupyter notebook and correct it? Thanks!
https://gist.github.com/PomoML/f940ae18237552ce419293a9b774f23a
BTW, the notebook shows a method to run reproducible tests. Initial weights are saved once and reloaded before each experiment. I was stumped for a while about the inconsistent results from the same parameters until seeing that dropout uses the (pseudo)random number generator. Once the randomizer seed is set consistently, the same run yields the same result.
HTH someone, and looking forward to learning how to add parameter k.