I am working on a Kaggle project and saw a notebook using Gaussian Mixture Model on a bimodal / multimodal distribution.

To understand it better I started learning about Gaussian Mixture Model from here

Now I have a basic high-level understanding of this concept and understand why it was used but I ** couldn’t** understand the full article especially the following:

Expectation-Maximization in Gaussian Mixture Models

E- Step

M Step

and the intuition behind complicated math

Here are my questions

Is it alright to have a high-level understanding or will it be wiser to deep dive?

How to understand such complicated stuff better tp be a better ML resource. I have tried but am unable to fully comprehend it. Any tips. suggestions are welcome.

I find such struggles frustrating and disheartening and at times I find myself taking my eyes off the main goal (Competitions / make applied AI apps) and find myself going down one Rabbit hole after another and yet end up with lots of confusion and wasted time.

One step at a time. Jeremy’s “dive right in” approach is great for getting stuff done – I know of no faster way to get to state-of-the-art results. But it can be high-frustration because it postpones the understanding. At my work some people love it and some hate it.

But if your goal is to do competitions and make AI apps, maybe pick a part of it that resonates, and gradually branch out from there. For example, you might specialize in either supervised or unsupervised, etc.

OK, GMMs. Sounds like the bugbear is the EM algorithm, not necessarily the idea of a mixture.

Good news: EM is simple unless you try to write the math. It’s a very general algorithm. Even K-means uses it. What changes with different models are the parameters you use to define your clusters, and that can (a) complicate the math, and (b) change the behavior. For Gaussians, the parameters are mean (center) and covariance (width of each axis). And where K-means has hard boundaries, Gaussians can overlap.

So, EM:

Drop some clusters onto the region. Guess parameters & relative probabilities π.

Note: choosing these well can help dramatically, but any guess will do.

Find the relative weight w_i of each data point x_i, for each cluster.

Calculate the usual p(x_i) - the curve height at x_i - for each cluster, and normalize.

Remember to use cluster weights as well as relative p(x_i) when normalizing.

Use those weights to re-estimate the cluster parameters

Estimate cluster parameters as usual, except replace x_i with corresponding w_i.

Estimate new π using average w_i. For each cluster, π = ∑ w_i / N

Iterate 2,3 until convergence.

For simplicity I omitted cluster subscripts above. Because w_i are separate for each cluster, all w_i should really be w_ij, for cluster j. And the cluster probabilities would be π_j. So the last line would be, “For each cluster j, π_j = ∑ w_ij / N”. Etc.

On the importance of good initialization, here’s my post on K-means initialization. The real credit there goes to pfh - I just helped localize the problem by re-implementing in Python.

Happily scikit-learn has good implementations of all these. If you are clustering, then Gaussian Mixture Models will almost always give better results than K-means, but they might be much slower. Scikit-learn’s clustering page has a fantastic visual comparison of the various clustering algorithms.