How to access numericalized tokens from TextLMDataBunch

I want to be able to access the numericalized tokens from TextLMDataBunch. It used to be as simple as:

data_lm = TextLMDataBunch.from_csv(path=PATH, csv_name='train.csv', test='test.csv')
numericalized_tokens=[data_lm.train_ds[i][0] for i in range(len(data_lm.train_ds))]

Much appreciated if anyone can help.


Same, but with a little .data after data_lm.train_ds[i][0]

1 Like

Thanks much!


As JH said , in Tokenization : each thing that we’ve got with spaces around it is represented as tokens and then rare works are replaced with special tokens .
During numericalization , all of the unique tokens that appear here, and we create a big list of them big list of unique possible tokens is called the vocabulary , then do is we replace the tokens with the ID .

  1. Is there a way to create a dictionary of token and its numerical value ? or its already there in … then how to access it .
  1. So, is there a way to know the dictionary of {token : id} .
    How these tokens are arranged in the big list , based on number of occurrences ?