I was reading this paper on comparison of Newton-Raphson (NR) algorithm and NNs. This is rather straightforward to understand, but I noticed that the authors only compared the forward pass time on a trained NN with the original NR execution time. The authors justified not including training time by saying it would be amortized over time. I agree with the idea of amortization (because otherwise even the simplest MLP trained on the best GPU available will probably have a hard time beating NR in terms of total time), but I’m not sure if that’s a strong enough argument. Other factors I can think of at the moment include hardware type and number available, optimization level of ML library used, and (I don’t actually know anything about this) whether NR involving more variables can be computed “block-wise” or in parallel like NN training.
Now, let’s assume that the dataset can represent the great majority of realistic physical events, i.e. we’d see almost identical training and validation loss throughout the training, and once a model is trained, we can be confident that forward passing with new (physically realistic) data will result in good predictions. For the question of final model accuracy, let’s use the NN-predicted values as the initial conditions to run NR again, i.e. “hot-starting” NR, instead of random/default initial conditions, i.e. “cold-starting” NR, so that both methods, NN+NR (hot) and NR (cold) will have the same final mismatch level. Given these two assumptions, should we still include NN training time in the performance evaluation criteria? Why or why not? Thanks everyone!