How to access numericalized tokens from TextLMDataBunch

DavidBressler · November 27, 2018, 10:03pm

I want to be able to access the numericalized tokens from TextLMDataBunch. It used to be as simple as:

data_lm = TextLMDataBunch.from_csv(path=PATH, csv_name='train.csv', test='test.csv')
numericalized_tokens=[data_lm.train_ds[i][0] for i in range(len(data_lm.train_ds))]

Much appreciated if anyone can help.

Thanks

sgugger · November 27, 2018, 10:55pm

Same, but with a little .data after data_lm.train_ds[i][0]

DavidBressler · November 27, 2018, 11:40pm

Thanks much!

jbo · July 17, 2019, 4:32am

HI,

As JH said , in Tokenization : each thing that we’ve got with spaces around it is represented as tokens and then rare works are replaced with special tokens .
During numericalization , all of the unique tokens that appear here, and we create a big list of them big list of unique possible tokens is called the vocabulary , then do is we replace the tokens with the ID .

Is there a way to create a dictionary of token and its numerical value ? or its already there in fast.ai … then how to access it .

So, is there a way to know the dictionary of {token : id} .
How these tokens are arranged in the big list , based on number of occurrences ?

Thanks