So as I was watching lesson 11, an idea struck me. What if instead of using an exponentially weighted moving average in our optimizer we used a Kalman filter instead.
The 1000ft description of a Kalman Filter is that it’s a state-space model that operates recursively on streams of noisy input data to produce a statistically optimal estimate of the underlying system state. I think of it as something in the family of hidden markov models
For simple systems, like a random walk + noise, Kalman Filters reduce to equivalence with exponentially weighted moving averages. However, even in this case, a Kalman Filter gives you an uncertainty that can be used to create a confidence interval around the exponentially weighted moving average estimate. I can imagine this being useful in an evolution of the LAMB optimizer where averages across a layer are being taken, points with high variance could be discounted in that average, just as an idea.
Background on Kalman Filters:
http://aircconline.com/ijcses/V8N1/8117ijcses01.pdf
http://greg.czerniak.info/guides/kalman1/
https://en.wikipedia.org/wiki/Kalman_filter
I created a gist here that’s a clone of Jeremy’s 09_optimizer notebook with the addition of a kalman filter plotted on the same data:
so the disclaimer here is I’m not the best mathematician but I thought it was worth sharing to see what this group could make of the idea.