I was following along lesson 6 SGD jupyter notebook and wanted to implement Adam optimization in pytorch. The problem is I couldn’t make it converge with the default recommended parameters of beta1=0.9 and beta2=0.999 , I used the same data distribution as the one Jeremy used for batch gradient descent. However It converged with these values beta1=0.7 and beta2=0.8
Here’s the code I used.
Is it just because the model is too simple and that the default parameters work well for most neural networks ? or is it a bug in my implementation?