This sounds like a pretty interesting project to me
I’ve usually tried to set the min count a bit lower than that, but I haven’t seen any strong rules about where the values should be.
I’d leave the extra symbols in if you think they have meaning, an question mark in english can really change the tone of a sentence (you’re talking to me / you’re talking to me?)
I think your project sounds more like word2vec than resnet It would probably be easiest to save the encoder and share that file around, anyone who’s trying to build a language model is going to be pretty cluey.
Yes, you’re right. I’m thinking about saving the vectors to a commonly used format like .vec of fasttext, then people can take it and shove it into their models. On the other hand, though, I also wanted to make a ‘fast-and-dirty’ way to do Thai text classification by allowing users to replace the last layer like transfer learning in images.
Most of the language models that I’ve seen are LSTMs, so retraining for transfer learning isn’t quite as simple as just replacing the last layer - but definitely possible. Maybe you could publish a working notebook that would allow someone to edit, add their own data, and get a good result?
Hi Jeremy. Thank you for your reply! I’ve managed to get min_count = 10 with smaller batch size. I eyeballed the words with lower frequency than that and found them to be some very random “words” as miscut by the segmentation algorithm. (thaiwordsareusuallywrittenwithoutspaces), so I think 10 is reasonable. This gave me about 50k embeddings.
I think I’ve saved both the models and the embeddgins with lfs in this repo https://github.com/cstorm125/thai2vec. Next step is trying to do the “transfer learning” of this model for text classification as shown with the sentiment analysis in lesson 4 notebook.
Updates: the “transfer learning” approach was a success! I think I’ve probably achieved state-of-the-art in Thai text classification at 94.4% for the most common open source dataset available. I said “probably” because I couldn’t find any previous literature but Facebook’s fastText can only get 65% (it’s a four-label classification problem). The most surprising thing is that I minimally changed anything from lesson 4 notebook!!
We follow Howard and Ruder (2018) approach on finetuning language models for text classification. The language model used is the one previously trained–the fast.ai version of AWD LSTM Language Model. The dataset is NECTEC’s BEST, which is labeled as article, encyclopedia, news and novel. We preprocessed to remove the segmentation token and used an 80/20 split for training and validation. This resulted in 119241 sentences in the training and 29250 sentences in the validation set. We achieved 94.4% accuracy of four-label classification using the finetuning model as compared to 65.2% by fastText using their own pretrained embeddings.
My question now if anyone’s still following this thread: How do I package the model nicely into a library (I’m contributing to an open source one called pyThaiNLP) for people to use without having to download all the pytorch dependencies?
I want to create a dataset of language text from materials science papers to see if we can predict the authors’ core training/expertise, irrespective of their formal affiliation. For example a chemist could think/write like a mathematician etc. It will be a better algorithm for recruiting.
Assumption is that these characteristics of language cannot be learnt from the imdb or wikipedia dataset. Is that right?
Do you have any pointers on how I can start creating such a dataset? Easiest source are pdf files; can the text from pdf files be copied to create the dataset or does it have to be html files?
I think we need to separate the problem into two parts:
Language modeling: this one definitely for research papers I think you can use wikipedia to train
Text classification: as I’ve done with Thai language and Jeremy has done with sentiment analysis, you can do transfer learning from the language model you’ve trained IF you have a dataset of papers already labeled with author’s expertise.
I think the most difficult part would be the labeling of the papers.