Language Model in Thai

Hi all,

I’m building a language model in Thai using the LSTM with dropout in Lesson 4 notebook. So far I’ve got quite acceptable performance with perplexity around 36 training on Thai wikipedia dump. I have several questions:

  • I set the min count of word frequency to 50 mostly because that’s the most my AWS instance can handle. I figured if a word appears less than 50 times in all of wikipedia, maybe it’s alright to relegate it to unknown tokens. What do you think?
  • I found many symbols, not only (*)/? and such but some Thai specific symbols and English words in the word embedding layers. Should I clean them out first? I was comparing my work with fasttext but I found in fasttext they also have many of these included.
  • My ultimate goal is to save the model as a pretrained model (resnet for Thai language classification if you may). What do you think is the best option to do this?

Thank you!

10 Likes

This sounds like a pretty interesting project to me :slight_smile:

I’ve usually tried to set the min count a bit lower than that, but I haven’t seen any strong rules about where the values should be.

I’d leave the extra symbols in if you think they have meaning, an question mark in english can really change the tone of a sentence (you’re talking to me / you’re talking to me?)

I think your project sounds more like word2vec than resnet :slight_smile: It would probably be easiest to save the encoder and share that file around, anyone who’s trying to build a language model is going to be pretty cluey.

Thanks for the comment!

Yes, you’re right. I’m thinking about saving the vectors to a commonly used format like .vec of fasttext, then people can take it and shove it into their models. On the other hand, though, I also wanted to make a ‘fast-and-dirty’ way to do Thai text classification by allowing users to replace the last layer like transfer learning in images.

1 Like

Most of the language models that I’ve seen are LSTMs, so retraining for transfer learning isn’t quite as simple as just replacing the last layer - but definitely possible. Maybe you could publish a working notebook that would allow someone to edit, add their own data, and get a good result?

1 Like

Done! https://github.com/cstorm125/thai2vec

5 Likes

Quite an impressive work…! :grinning:

1 Like

Just found @jeremy 's paper on fine-tuning language model for classification. Maybe I’ll try to have a junior version of that done.

1 Like

Thanks bruh

I’ve found a vocab size of about 30,000 is a good compromise between performance and utility. My guess is min_freq of 50 may be too high - see how large your vocab is as a results.

Symbols can be an important part of language. I leave them in.

Providing a pre-trained model would be great! Perhaps github large file storage (LFS)? Or use Dreamhost? I can put it on our dreamhost account if you’re interested.

1 Like

Hi Jeremy. Thank you for your reply! I’ve managed to get min_count = 10 with smaller batch size. I eyeballed the words with lower frequency than that and found them to be some very random “words” as miscut by the segmentation algorithm. (thaiwordsareusuallywrittenwithoutspaces), so I think 10 is reasonable. This gave me about 50k embeddings.

I think I’ve saved both the models and the embeddgins with lfs in this repo https://github.com/cstorm125/thai2vec. Next step is trying to do the “transfer learning” of this model for text classification as shown with the sentiment analysis in lesson 4 notebook.

3 Likes

Updates: the “transfer learning” approach was a success! I think I’ve probably achieved state-of-the-art in Thai text classification at 94.4% for the most common open source dataset available. I said “probably” because I couldn’t find any previous literature but Facebook’s fastText can only get 65% (it’s a four-label classification problem). The most surprising thing is that I minimally changed anything from lesson 4 notebook!!

We follow Howard and Ruder (2018) approach on finetuning language models for text classification. The language model used is the one previously trained–the fast.ai version of AWD LSTM Language Model. The dataset is NECTEC’s BEST, which is labeled as article, encyclopedia, news and novel. We preprocessed to remove the segmentation token and used an 80/20 split for training and validation. This resulted in 119241 sentences in the training and 29250 sentences in the validation set. We achieved 94.4% accuracy of four-label classification using the finetuning model as compared to 65.2% by fastText using their own pretrained embeddings.

8 Likes

My question now if anyone’s still following this thread: How do I package the model nicely into a library (I’m contributing to an open source one called pyThaiNLP) for people to use without having to download all the pytorch dependencies?

2 Likes

Exciting news! How about making it a conda package and make an environment.yml file like we do?

2 Likes

I will read up on that. Thanks!

@charin, exciting work, thanks for sharing!

I want to create a dataset of language text from materials science papers to see if we can predict the authors’ core training/expertise, irrespective of their formal affiliation. For example a chemist could think/write like a mathematician etc. It will be a better algorithm for recruiting.

Assumption is that these characteristics of language cannot be learnt from the imdb or wikipedia dataset. Is that right?

Do you have any pointers on how I can start creating such a dataset? Easiest source are pdf files; can the text from pdf files be copied to create the dataset or does it have to be html files?

Yes, I used this website before. It is very easy to use without the need for installation.

Apache Tika should work well for a large volume of dataset in multiple formats. But, I haven’t tried it yet.

Thanks @Moody ! The Apache Tika seems more like what I need for batch processing pdf files. If I can access html versions of the papers, do you know of a way to convert that into the dataset format?

@Moody also found this NLTK based solution to create a corpus from *.txt files: https://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk

I am guessing this is a way to basically format the txt from multiple files and put them in one file? Also I don’t see too many people talking about using NLTK - is there a reason?

I am going to give this solution a try:

I think we need to separate the problem into two parts:

  • Language modeling: this one definitely for research papers I think you can use wikipedia to train
  • Text classification: as I’ve done with Thai language and Jeremy has done with sentiment analysis, you can do transfer learning from the language model you’ve trained IF you have a dataset of papers already labeled with author’s expertise.

I think the most difficult part would be the labeling of the papers.

Woah ! This is pretty neat. I might have some future questions for you. :clap: