I am trying to implement the code in
course-v3/nbs/dl1/lesson3-imdb.ipynb and ran into this problem which causes the vocab list saved as
itos.pkl to be inconsistent. It seems that running TextDataBunch on the IMDB .csv dataset returns a vocabulary with a different size. Is this just the vocab of a particular batch? If so, why is this saved as itos.pkl and how can I go about generating the correct vocab list for the whole corpus?
path = untar_data(URLs.IMDB_SAMPLE) path.ls() data_lm = TextDataBunch.from_csv(path, 'texts.csv') data_lm2 = TextDataBunch.from_csv(path, 'texts.csv') print(len(data_lm.vocab.itos[:]), len(data_lm2.vocab.itos[:]))
Notice that the two vocab sizes are different.