Great work @morgan!
I really found this a very useful way of quickly gaining a high level understanding of a large unknown corpus. Since XLM-R is multilingual you can also just use it as is with many languages. I tested it with Norwegian (fairly close to English to be fair), and got good results. The total run time for processing 25k documents was also only around 20 minutes. Doing a first scan like this can lead to many insights into your corpus which would otherwise require a lot of reading time.
In my experience PCA seems to clump most of the data points together, and t-SNE/UMAP simply gives much more interesting results. Chris Olah, btw, has a very nice post on dimensionality reduction. t-SNE/UMAP also sometimes require a bit of fiddling to give great results, see e.g. this distill article.
UMAP/t-SNE does suffer a bit though when the data size grows. One possibility is to use PCA first, to get the data into a more manageable size. Also, the sklearn t-SNE is very slow compared to UMAP, but RAPIDS have GPU accelerated implementations which makes both algorithms much faster, see eg. t-SNE and UMAP.
Finally, t-SNE results seems to be dependent on how it is initialized. I haven’t looked into this in much detail. But I couldn’t help thinking that there were some parallels to how nn’s need to be properly initialized in order to learn?
Anyway, I’m really interested in seeing other usecases of xlm-r + umap or similar!