On African Languages and Sentence Piece
– Since Africa has 2000+ languages, when I use “African” I’m probably meaning “Bantu” languages, and by Bantu languages I’m really thinking of “Nguni” languages, which is pretty much the languages of Southern Africa, in particular languages with “clicks”–
Since there aren’t many NLP resources for many African languages, I’ve been using Sentencepiece and a lot of web scraping for my chosen language of Xhosa. If I can achieve great results I’d like to publish at the least a blog post on dealing with “minority” languages which don’t currently have language models, or great language models.
Integrating Sentencepiece into FastAI
This so far has been a hit and miss for me. I’m able to tokenize the words (essentially what Spacy is doing). Sentencepiece provides two files after training on a corpus of text: A “model” file and a “vocab” file. The model file is used by sentencepiece to tokenize words into a list of IDs and a “vocab” files seems to be a list of tokens and their float weights.
I’m not sure how to take my “vocab” file (list of tokens and associated weights) and treat it/pass it into FastAi.text as the expected “h5” file. For example FastAi.text is using PRE_LM_PATH = PRE_PATH/'fwd_wt103.h5'
. I’m unsure if I can instead plugin my vocab file without minimal changes to it. (I intend to look at FastAi.text library to see how it uses weights, and maybe that might provide insight). However if anyone using sentencepiece has gone through this process already (cc @jeremy) I’d certainly appreciate your direction.
Memory with Sentencepiece and FastAI
My memory usage with sentencepiece to train on a document of about ~500,000 words was pretty low and fast. Less than a minute on my laptop (I know this isn’t specific enough). However when I tried to train my model with learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1)
on the same similar document I get out of memory errors on my laptop and also on an AWS p2.xlarge.
I tried decreasing my batchsize (bs) in half, which didn’t help. I’m not sure if I should keep the same embeddings, number of layers etc., as is: em_sz,nh,nl = 400,1150,3
and if that affects memory. At any rate I feel I’m way beyond my scope of expertise here.
At any rate if I get this to work, I want to start looking into speech-to-text. I’d like to crowdsource audio recordings of sentences from my corpus (which I scraped from Wikipedia, Twitter, Xhosa Bible, Online Xhosa books) from hundreds of Xhosa speakers… I have a friend from Senegal who’s mother only speaks Wolof and therefore can’t use voice commands on her phone. I think tech can be localized/diversified more further if more people could use tech easily in languages they are comfortable and more familiar with.