Optimizer Tweak Idea: Replace EWMA with Kalman Filter

So as I was watching lesson 11, an idea struck me. What if instead of using an exponentially weighted moving average in our optimizer we used a Kalman filter instead.

The 1000ft description of a Kalman Filter is that it’s a state-space model that operates recursively on streams of noisy input data to produce a statistically optimal estimate of the underlying system state. I think of it as something in the family of hidden markov models

For simple systems, like a random walk + noise, Kalman Filters reduce to equivalence with exponentially weighted moving averages. However, even in this case, a Kalman Filter gives you an uncertainty that can be used to create a confidence interval around the exponentially weighted moving average estimate. I can imagine this being useful in an evolution of the LAMB optimizer where averages across a layer are being taken, points with high variance could be discounted in that average, just as an idea.

Background on Kalman Filters:
http://aircconline.com/ijcses/V8N1/8117ijcses01.pdf
http://greg.czerniak.info/guides/kalman1/
https://en.wikipedia.org/wiki/Kalman_filter

I created a gist here that’s a clone of Jeremy’s 09_optimizer notebook with the addition of a kalman filter plotted on the same data:

09_Kalman_Optimizer_Gist

so the disclaimer here is I’m not the best mathematician but I thought it was worth sharing to see what this group could make of the idea.

5 Likes

Great idea. One thing to keep in mind is that the Kalman Filter is computationally much more demanding per step, so you would need to have a quite large improvement in convergence speed (in terms of number of steps) to compete with EWMA.
kfac was an optimizer that looked like it might do that, but it didn’t become the default (but it had things like not being agnostic of what the parameters were etc.).
What results are you getting with Kalman in the optimizer?

Best regards

Thomas

1 Like

Totally agree. No results yet, I’m still thinking through how to model the transition covariance matrix. The upside is there is so much more flexibility to a Kalman Filter, the downside, besides the computation cost, is all the flexibility!

Is the memory use the same as ewma?

I would think it’s the same because all it needs is the previous state and the current state. Structurally it’s very similar to ewma in that way.

1 Like

Has there been any followup/results regarding this?