Now I see how you can think of training a model as repeatedly applying the same function. I had never thought of it that way before.

The example you cite diverges when the numbers stray off one given, unstable starting orbit. It is a pathological example specifically designed to illustrate a theoretical problem. I don’t think the principle applies to training a model.

First, using simple SGD with small enough learning rate, loss is decreasing. Therefore there are no fixed orbits in the weight space for numerical errors to accumulate in. Second, we WANT weights to fall out of quasi-stable states and converge to a loss basin. That’s what training is supposed to do. The only place I can see for this example to apply is that nearby weight initializations may end up in different basins. But we already knew that, and this fact may turn out to be a feature, not a bug.

Caveat: argument is informal and my math is rusty.