Embeddings: can latent factors serve as training labels?

I just finished Lesson 4 and the Lesson 4 notebook. Toward the end, there’s an exercise where the learned embeddings are decomposed with PCA to show where the movies lie on various interesting axes, such as ‘critically acclaimed’ and ‘violent’. I found this remarkably powerful, as we’re able to divine these interpretable features without directly modeling the content of the movies (for example, we’re not using NLP to evaluate the scripts).

This caused me to wonder: what if we had a related task where we’re interested in modeling the relationship between content and concepts such as ‘violence’, but where we had no training labels. Could we use our learned embeddings as target labels?

For example, if we rounded up all of the movie posters for the MovieLens films, could we then train a CNN to learn visual features that help to predict conceptual labels such as ‘violent’? It seems to me that there are many instances where we might have a lot of associative data like MovieLens, but few labels along semantic/conceptual dimensions.


Yep! Check out the DeVise paper which did just that with ImageNet and WordNet.


Fascinating! Thanks so much for the reference. What a tremendous resource this course and community are!