Why creation of a DataBunch for a Language model is relatively slow process. For example, takes few minutes for 100K reviews of IDMB.
Is not it supposed to be a simple a linear process just to tokenize the text and then to create a dictionary of all existing unique words (up to some predefined size) and then converting them to numbers in the text.
Should not this process to take just few seconds instead of few minutes?
1 Like
@sgugger I saw your recent tweet. So now Tokenization process will become much faster?
Yes, that’s what I observed at least.
1 Like
Should I update for the new version of fast.ai library to see the improvement?