Lesson 8 official topic

This post is for topics related to lesson 8 of the course. This lesson is based partly on chapter 8 of the book.

This is a wiki post - feel free to edit to add links from the lesson or other useful info.

<<< Lesson 7

Lesson resources


Stream looks good now

1 Like

Is PCA useful in any other areas like visualizing this way if you have some domain knowledge?

This is a great post on visualising high dimensional data with sample python code of PCA and TSNE.


Another interesting non linear technique is “UMAP” that kind of use a train-inference approach (AKA you can apply the model to unseen data).

1 Like

…You usually use your “domain knowledge” to interpret (aka give a meaning) to the resulting reduced dimensions.


Is there an issue where the bias components are overwhelmingly determined by the “non-experts” in a genre?
For instance, you used two different phrasings for high and low biases:

  1. If the bias is low, you state that “even if you like this type of film,” you wouldn’t like this one

  2. If the bias is high, the opposite “even if you don’t like this type of film,” you would like this one

Is seems like the population in statement 2 is going to be much larger, and then would have a much larger impact? I can’t put my finger on it, but it feels vaguely related to fast-food chains getting 5 out of 5 stars on Yelp… even though a burger expert might have a pointedly different opinion?

1 Like

Is collaborative filtering related in any way to correspondence analysis (which also represents both the rows and columns of a matrix in the same embedding space)?

1 Like

On a separate note…

If I recall correctly, PCA is just a single layer autoencoder with a linear activation function (so technically not a neural network since the activation function is not non-linear)…

Basically, autoencoders in general are a generalization of PCA, where both learn some function (PCAs only able to do a linear function) to map a high dimensional dataset to a low dimensional embedding space…


So, word embeddings can be used for both categories in tabular data and NLP data.

How do we come up with the size of the embedding, like in this case it is 5? And the different values for each.

1 Like

I don’t really get why we need for the word embedding 4 columns (latent variables?) in the matrix, when we have one dimensional unique identifiers, the indices for each of the words in the vocabulary. I have a gap in my mental picture.

Well the embeddings are usually learned such that there is actual semantic meaning behind the embeddings.

For example, the famous “word2vec” embeddings are known to have some interesting properties where for example, you could do math with the embeddings for the words:
“king - man + woman”
and get an embedding close to the one for “queen”

You can imagine that having that semantic information would be much more helpful to a neural network model than the unique indices…


Interesting point of view on PCA!
I was always thinking about it as SVD / pseudo-inverse.

SVD: Singular Value Decomposition (SVD): Overview - YouTube

1 Like

Does using embeddings trained from another model (neural network), and using it in a random forest cause data leakage?

1 Like

Shouldn’t if you’re using the same train/validation split.


Can anyone recommend or share a notebook that demonstrates a good implementation of word embeddings combined in a two step process with a boosted tree or random forest as described using the fastai library?

1 Like

BTW I have found this resource helpful for visualizing convolutions:


(I think a similar diagram might be in the fastai book)…


Is there typically a different dropout mask for EACH activation layer, or only near the end of your convolution stack?

If applied repeatedly, it seems like even a small dropout % could drastically reduce the ability for learning to occur?

End of the course :slightly_frowning_face:
Thanks for another great course, especially with a surprise Lesson 8 :smile:

Looking forward to Part 2! :tada: