Hello! This is not really fastai related, but I got many good answers from the community so I decided to ask this here, too.

I read the NeuralODE paper and I am a bit confused so I want to make sure that what I understood so far is correct.

Assume we have (for simplicity) 2 data points z_0 measured at t_0 and z_1 measured at t_1. Normally (in normal NN approach), one would train a NN to predict z_1 given z_0, i.e. NN(z_0)=z_1. In NeuralODE approach, the goal is to train the NN to approximate a function f(z_0) (I will ignore the explicit time dependence) such that given the ODE: \frac{dz}{dt}|_{t_0}=f(z_0) which would be approximated as \frac{dz}{dt}|_{t_0}=NN(z_0) and solving this using some (non AI based) ODE integrator (Euler’s method for example) one gets as the solution for this ODE at time t_1 something close to z_1. So basically the NN now approximates the tangent of the function (\frac{dz}{dt}) instead of the function itself (z(t)).

Is my understanding so far correct?

So I am a bit confused about the training itself. I understand that they use the adjoint method. What I don’t understand is what exactly is being updated. As far as I can see, the only things that are free (i.e. not measured data) are the parameters of the function f, i.e. the NN approximating it. So one would need to compute \frac{\partial loss}{\partial \theta}, where \theta are the parameters (weights and biases of the network).

Why would I need to compute, for example (as they do in the paper) \frac{\partial loss}{\partial z_0}? Z_0 is the input which is fixed, so I don’t need to update it. What am I missing here?

Secondly, if what I said in the first part is correct, it seems like in principle one can get great results for a reasonably simple function f, such as a (for example) 3 layers fully connected NN. So one needs to update the parameters of this NN. On the other hand, ResNets can have tens or hundreds of layers.

Am I missing a step here or is this new approach so powerful that with a lot fewer parameters one can get very good results?

I feel like a ResNet, even with 2 layers, should be more powerful than Euler’s Method ODE, as ResNets would allow more freedom in the sense that the 2 blocks don’t need to be the same, while in the NeuralODE using Euler’s Method one has the same (single) block.

Lastly, I am not sure I understand what do they mean by (continuous) depth in this case. What is the definition of the depth here (I assume it is not just the depth of f)?