Text exploration and cleaning with UMAP

wgpubs · July 15, 2020, 5:48pm

Doesn’t sound like big documents so I’d see if any of the pre-trained multi-lingual models work so you don’t have to risk losing info in the translation to English for non-English texts. It’s a win for you if you can do this as training a custom LM for ea. language against most any architecture, whether it be a transformer model or ULMFiT, is going to take considerable time and resources.

Re: ULMFiT, like Pablo, I’ve had good success with it for LM, document embeddings, and sequence classification tasks on English texts. For things like NER and summarization, I use huggingface with plans to explore using it for document embeddings and classification tasks as well to see how it compares with what I get from ULMFiT.

I’ve read both their papers and I lean towards using Longformer as it appears to be more friendly in customizing for various downstream tasks. I can’t remember why, but while Reformer may allow you to train on longer sequences than Longformer, there is something about it that limits its usability (or at least makes it difficult) for tasks outside of LM.