I want to train the Hindi language model
but when creating
here txt is a Hindi text file around 8MB
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))
output/
(#765035) [‘xxbos’,‘रुरुक’,’(’,‘अयोध्या’,‘के’,‘सूर्यवंशी’,‘राजा’,’)’,’\n’,‘अयोध्या’,‘के’,‘सूर्यवंशी’,‘राजा’,’।’,’\n’,‘वृक’,’(’,‘अयोध्या’,‘के’,‘सूर्यवंशी’,‘राजा’,’)’,’\n’,‘अयोध्या’,‘के’,‘सूर्यवंशी’,‘राजा’,’।’,’\n’,‘बाहु’,’(’…]
num = Numericalize()
num.setup(toks)
coll_repr(num.vocab,20)
(#264) [‘xxunk’,‘xxpad’,‘xxbos’,‘xxeos’,‘xxfld’,‘xxrep’,‘xxwrep’,‘xxup’,‘xxmaj’,‘ा’,‘र’,‘क’,‘्’,‘े’,‘ि’,‘त’,‘स’,‘न’,‘ह’,‘ं’…]
why vocab comes only 264 length
why a single character comes in vocab