The loss function looks like below, where N is the number of time series. ignore N for now and say N = 1. T is length of prediction horizon.

During training t_0 starts from encoder time step 0 all the way to last decoder time step T.

How does the sum of **individual** log of negative binomial (nbinom ) likelihood loss function at each time step contribute to the convergence of nbinom parameters?

How does maximizing **individual** time steps likelihood result in convergence of nbinom paramters?

e.g. say an input to a time step is 1. the paramters (5,0.75) maximize it.

However because the input can come from any sub-sequence it could have come from another distribution (10,0.5) for this time step.

The encoder input comes from any sub-sequence of the time series (i.e. phase shifted) so the nbinom parameters learnt cannot associate with a fixed time step. So during the initial stages of training how does the network know the right nbinom parameter to update by first detecting the phase of the encoder input? Can it detect the phase of the encoder input?