Hindi language model from scratch

I want to train the Hindi language model
but when creating

here txt is a Hindi text file around 8MB

tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

output/
(#765035) [‘xxbos’,‘रुरुक’,’(’,‘अयोध्या’,‘के’,‘सूर्यवंशी’,‘राजा’,’)’,’\n’,‘अयोध्या’,‘के’,‘सूर्यवंशी’,‘राजा’,’।’,’\n’,‘वृक’,’(’,‘अयोध्या’,‘के’,‘सूर्यवंशी’,‘राजा’,’)’,’\n’,‘अयोध्या’,‘के’,‘सूर्यवंशी’,‘राजा’,’।’,’\n’,‘बाहु’,’(’…]

num = Numericalize()
num.setup(toks)
coll_repr(num.vocab,20)
(#264) [‘xxunk’,‘xxpad’,‘xxbos’,‘xxeos’,‘xxfld’,‘xxrep’,‘xxwrep’,‘xxup’,‘xxmaj’,‘ा’,‘र’,‘क’,‘्’,‘े’,‘ि’,‘त’,‘स’,‘न’,‘ह’,‘ं’…]

why vocab comes only 264 length
why a single character comes in vocab