Language Model Zoo šŸ¦

Iā€™ll give it a try and will probably come back to the forums soon with a ton of questions! Thanks

FYI I found this recent baseline paper for Chinese language and the competition is opened till the end of this month.

2 Likes

Thanks for Jeremy to recommend this paper for Chinese classification problem.

1 Like

btw I speak Mandarin fluently and read/write Chinese (Simplified mostly - but also Traditional) so if you wanted any help tackling Chinese let me know! This looks like an interesting challenge.

I wonder if characters could be treated more like images for the CNN to recognize? I know when I am reading Chinese its kind of like an instant recognition of the whole character as one unique ā€œobjectā€ which is quite different from reading English or other phonetic-based languages.

Most Chinese characters, especially with Traditional, have a meaning component aka the ā€œRadicalā€ and also a phonetic component. In fact, to look up a word in a Chinese dictionary the process is to first search for the radical - and then in that subset of that radical you locate the word based on the number of strokes. So Iā€™m also wondering if the characters could be broken down in some meaningful way to capture these features which are embedded in the characters.

1 Like

Iā€™ve begun to train a French Language Model on Wikipedia.fr (well part of it). Tokenizing always made me run out of memory so Iā€™ve changed a bit Jeremyā€™s functions to save the tokens on each chunk instead of concatenating them in one big array. All the details are in this notebook if someone has the same problem.
I discovered afterward that I had 600 million tokens so be sure to listen to Jeremyā€™s advice regarding this!

Iā€™ve begun to train with high learning rates (it will go up to 8!) and Leslieā€™s 1cycle policy. First results are looking good since the first epoch gave me a validation loss of 3.98 and 27,7% accuracy (one epoch is 33K iterations though). Iā€™ve launched a cycle of 10 epochs (might be a bit of an overkill since it represents 330K iterations) and will share the results tomorrow!

7 Likes

Thatā€™s exciting - no-one has looked at this before AFAIK so thereā€™s probably a paper to be written there, if youā€™re interested (and Iā€™m happy to help if wanted). I think it would basically be a case of doing lots of experiments to show how 1cycle impacts learning rates, gradient clipping, and regularization in AWD LSTM and NLP classification. Ideally youā€™d find you can decrease the regularization quite a bit, which should help particularly where you have less data.

6 Likes

That sounds super interesting but Iā€™m not sure I would know where to begin. Letā€™s also see the final results first, the training has plenty of time to diverge yet :wink:
To get an idea of what represents good results, what was the validation loss (or perplexity) and accuracy of your final model trained on wiki103 we finetune after?

You may find this paper interesting. Enjoy :blush:

You really know how to appreciate the Traditional Chinese characters. I was born in Hong Kong and studied ancient Chinese literature. While I am still working on the ā€œmodernā€ Traditional Chinese, hopefully, the future step is to build the language based on 四åŗ«å…Øę›ø :blush:

1 Like

So I couldnā€™t resist launching another machine on the same training but with less epochs overnight. In two epochs (still 64K iterations), I got a validation loss of 3.53 (perplexity 34.28) and an accuracy of 31.5%.

To detail the steps that led me there, the model is exactly the same as in the imdb notebook, and I took the dropouts Jeremy shared

drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])

I didnā€™t use Adam since it doesnā€™t seem to work well with high learning rates, so my line for opt_fn is:

opt_fn = partial(optim.SGD, momentum=0.9)

I admit I forgot the weight decay, so I just ran the lr_finder (the version 2 to avoid running for a full epoch since itā€™s sooooooo long) and here is the plot
lrfind

10 sounded a little bit too dangerous so I picked 8 for my maximum learning rate, then ran my favorite line

learner.fit(lr, 1, cycle_len=2, use_clr_beta=(10,10,0.95,0.85))

And here are the results for a cycle of length 2:
results
Now waiting for the cycle of length 10 to be completed on my other machine, but itā€™s already at 3.48 of loss on epoch 6, so it should be better.

Iā€™d love to test this on the wiki103 set, since there are benchmarks available. Jeremy, if you have a mean to share the numericalized version of it it would be tremendously helpful since I wouldnā€™t have to spend the day tokenizing it.

6 Likes

My beginning step would be to ask Sebastian Ruder to help us :slight_smile:

I didnā€™t find a great way of comparing, because I found that depending on dropout I could get the same effectiveness after fine-tuning on imdb for a wide range of different ppl and accuracy on wt103. Ppl of anything under 50 is pretty great. Hereā€™s a table from smerityā€™s latest:

5 Likes

Letā€™s ask him then.
Not sure if you saw my earlier post since we replied at the same time, but training didnā€™t diverge and gave some pretty good results.

Again, not sure if you saw my earlier post but if you happen to have your numericalized wiki103 available it would be great for me to begin testing on this dataset.

2 Likes

For Swahili, Iā€™ve collected a set of 1573 pdfs from Tanzaniaā€™s Parliament website. I wrote a bash script to read all of them in sequence, convert the pdfs to TIFFs, and extract the raw text using tesseract, and so far this has got me 101 files from the hansard, totaling about 4M words, obtained using cat * | wc -w. This is already bigger than Wikipedia, but there are OCR errors that make this estimate inaccurate, and mistaken tokens might be elided when tokenizing by using min_count=3 or something like that. The process is ongoing, and Iā€™ll be updating the dataset in multiple phases until we get to 10M.

7 Likes

I didnā€™t see that - quite terrific! Looks like you should be able to drop your dropout by a lot and get much better ppl, FYI.

Iā€™ve popped the numericalized file and vocab here for you: http://files.fast.ai/data/wt103_ids.tgz

3 Likes

:open_mouth: that would be amazing if you could create your LM based on 四åŗ«å…Øę›ø! Have you read all of å››å¤§åč‘—ļ¼ŸI used to read a lot of those books when I was in school :grin:

Also thanks for sharing the paper on using CNNs for classifying sentiment in Chinese text although it looks like they are just using 1D CNNs as an alternative to RNNs. I was still thinking to use RNNs but instead of just using a tokenization from Unicode I thought perhaps it could be possible to treat each unique character as an image (represented by CNN feature extraction) and so in the same way that video is a sequence of images here we are applying the same technique with the RNN only itā€™s a sequence of feature-rich characters.

1 Like

Thanks for the files!
I hadnā€™t realized that the vocab size for the Language Model wasnā€™t truncated to 60k words like in our fine-tuning (which makes sense if we want it to be as general as possible) so I canā€™t use the same batch size (128) and itā€™ll probably impact the loss since itā€™s harder to predict one word out of that many.

Speaking of which, training on a longer cycle the French Model (10 epochs) with very high learning rate allowed the model to find a better minimum: the final validation loss was 3.33 (27.9 perplexity!) and the accuracy was around 33%.

On all my experiments, Leslieā€™s 1cycle policy works best when we train a new model from scratch than when you finetune something, and it can give you a decent model very fast with super convergence, but itā€™ll also gain in precision if you let the cycle be longer. And 1 long cycle is far better than two shorter ones in a row.

2 Likes

Oh I apologize I didnā€™t mean to give you that one - my attempts at using a larger vocab didnā€™t result in better models, and were much harder to train. Iā€™ve uploaded a replacement now to the same place with 30k vocab. Thatā€™s what I used in practice.

BTW when we donā€™t use the same vocab size as a paper weā€™re comparing to we canā€™t exactly compare, since it impacts the difficulty of the problem. Also, IIRC French is an easier language for an LM. Will be very interested to hear how the en model goes with similar settings - that fr model result is really encouraging.

2 Likes

Try using sentencepiece. Should give the a nearly-optimal segmentation.

1 Like

Do you recommend using sentencepiece in a situation where one does not have a readily-available tokenizer? Iā€™m buliding a Swahili AWD-LSTM model on Wikipedia, fine-tuned to the Hansard I posted about earlier, but spacy doesnā€™t have a model for it, and there are no publicly available tokenizers.

30k words is much easier indeed, thanks for the new file. I can now go to a batch size of 256 and even higher learning rates! :sunglasses:

2 Likes

Ndiyo!

4 Likes