Language Modelling for under resourced languages

Hi there,

I’ve gone through the process of downloading wikipedia dumps for some of the African languages (Zulu , Afrikaans and Sotho). After tokenization I have < 300K unique tokens for all languages except Afrikaans which is a tad over 1M unique tokens. I’ve read in the forums that we should cap the number of tokens to 100M, is there a bottom end limit too?

I’m in the process of building language models for these languages, and was wondering what sort of success anyone else has had building language models with this few tokens. What sort of things can I look at for improving.

I’ll post some results as I get them, but am keen to hear from anyone who has faced similar issues.

Thanks a lot.