I want to be able to access the numericalized tokens from TextLMDataBunch. It used to be as simple as:
data_lm = TextLMDataBunch.from_csv(path=PATH, csv_name='train.csv', test='test.csv')
numericalized_tokens=[data_lm.train_ds[i][0] for i in range(len(data_lm.train_ds))]
As JH said , in Tokenization : each thing that we’ve got with spaces around it is represented as tokens and then rare works are replaced with special tokens .
During numericalization , all of the unique tokens that appear here, and we create a big list of them big list of unique possible tokens is called the vocabulary , then do is we replace the tokens with the ID .
Is there a way to create a dictionary of token and its numerical value ? or its already there in fast.ai … then how to access it .
So, is there a way to know the dictionary of {token : id} .
How these tokens are arranged in the big list , based on number of occurrences ?