TextDataBunch returning inconsistent vocab lengths

ningzy · February 18, 2019, 10:56am

I am trying to implement the code in course-v3/nbs/dl1/lesson3-imdb.ipynb and ran into this problem which causes the vocab list saved as itos.pkl to be inconsistent. It seems that running TextDataBunch on the IMDB .csv dataset returns a vocabulary with a different size. Is this just the vocab of a particular batch? If so, why is this saved as itos.pkl and how can I go about generating the correct vocab list for the whole corpus?

To reproduce:

path = untar_data(URLs.IMDB_SAMPLE)
path.ls()
data_lm = TextDataBunch.from_csv(path, 'texts.csv')
data_lm2 = TextDataBunch.from_csv(path, 'texts.csv')
print(len(data_lm.vocab.itos[:]), len(data_lm2.vocab.itos[:]))

Notice that the two vocab sizes are different.

nickyeolk · February 19, 2019, 1:56am

I thought that this problem was because TextDataBunch splits the data into train and test, so you are getting a random test sample each time. However, this problem also occurs when using TextLMDataBunch which shouldn’t have a test set (I could be wrong here).