Scratch everything I said above.
I’m keeping my post above so folks that are similarly confused by similarly thinking about PCA as I was, can see where I went wrong with my assumptions of how it worked. I spent this past weekend really digging into PCA … how it works, where it is useful, and to what objectives it can be applied too. I include a list of resources below that I found helpful, but let’s get to answering your two questions first:
Why do we choose to fit the transpose of the movie embedding rather than the embedding itself?
The answer lies in part by asking ourselves two questions:
- "What does the data represent?
- “What is the problem we are attempting to solve?”
The answer to the first question is that our (3000, 50)
matrix represents 3,000 movie reviews by 50 things the model has learned about movie reviews that make them meaningful for language modeling and classification. We don’t know what each of these 50 embedding values represent, but we do know that whatever they are, they provide a good representation of each movie review because they have proven useful in the classification task.
The answer to the second question is that we are trying to reduce these 50 things to 3 things and then figure out how related each of the 3k movie reviews are to each of these 3 things so we can infer what the 3 things represent. We are asking, “How can we cluster these 50 different things learned into 3 big areas?” (Notice that we are NOT trying to reduce the dimensionality of each movie review. If we were, we wouldn’t want to transpose the embedding matrix and we wouldn’t care about the components.)
So …
By transposing the matrix so that are examples become the 50 things each of our embeddings have learned about the movie reviews and the features become the 3k movie reviews, when we call pca.fit(movie_emb.T)
, we are asking PCA to essentially figure out how much each review plays a part in each of the 3 learned principal components. That is exactly what we want!
Why do we take the components from the fit PCA rather than transforming the matrix?
Because they represent the eigenvector for each principal component.
What is an ‘eigenvector’ and why do we care about this?
Simply put, the eigenvector represents how many parts each feature (the 3k reviews) plays in composing the PC. The higher the number, the more important the review is for that PC, and therefore, the more representative of what the PC means.
So look at the dimension of .components_
and you’ll notice that it is (3,3000)
. The first row is the eigenvector for PC 1 … and each value in there tells you much each review played a part in learning that PC. Thus, if you order the reviews in the first row in descending order, you will have those that are most strongly correlated to PC 1. That information can then be used in turn to infer what PC 1 means.
Grab a cold one and think on all this and I guarantee it will start to make sense
Helpful resources:
- https://www.youtube.com/watch?v=FgakZw6K1QQ
- https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
- https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
- https://medium.com/bluekiri/understanding-principal-component-analysis-once-and-for-all-9f75e7b33635
- https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
- https://stats.stackexchange.com/questions/311908/what-is-pca-components-in-sk-learn