Two-step language model training

If my understanding is correct, the training of language model may be divided into two parts: unsupervised, when the embeddings are trained and the second, when the weights of the neural network (AWD LSTM in case of ULMFiT) are learned. I suppose it can be a good idea set the embeddings on one corpus and then train the network on the another corpus to make these two processes independent. Is it somehow possible to implement?

If I understand your question, this process is shown in Part 1, Lesson 4.

Haven’t found it there though inspected the whole chat :frowning:

Why are we wanting to make this process independent? If you read the ULM-FiT paper, it was found that because of going from training a LM on your corpus with all available text to then classifying on that corpus it showed the SOTA results and they showed transfer learning could be applied to text.

I will describe the real case: I would like to train Russian ULMFiT to have a SOTA Russian LM model. I have trained it on 17 mln twits and it often sticks to the grammar and predicts valid results, but when I try to additionally train it on wikipedia corpus, it does not progress at all (I have tried many different types of learning rates and they were not the matter). Moreover, as the twitter lexicon often is not correct and have many jargon words, the max vocabulray (60k) is full and many valid words which occur in wiki do not get personal embeddings. Thus to make the model more adequate I would like to train word embeddings (unsupervised learning) on wiki to correctly set the vocabulary and then learn the weights of the LM on twitter. Have been trying to do that for a week but did not succeed, thus ask for help :slight_smile:

My understanding is you should train the language model in a broader area, In your case, general Russian. Now the model speaks Russian, you want the model to fine-tune on the domain task. In your case, russian tweets. So now you have a model that speaks Russian tweets language. You start to do classification.

Also, you can change the max_ vocab size. Based on the test on the NLP chanllege post, you can try Sentence Piece tokenizer to have sub word structure that currently reported better than tranditional space tokenization?

Finally, train backward language model, ensemble the two will also give you a boost in accuracy.

1 Like