How to use the exact same vocab in Language Model | Size mismatch error while loading encoder

I am working on Kaggle Amazon mobile phone review data.

Steps:

  1. I loaded CSV file in pandas

  2. Create TextFile for language model
    data_lm = (TextList.from_df(df, path, cols=4).random_split_by_pct(0.1).label_for_lm().databunch())

    data_lm.save("amazon_data_lm")

  3. I then check vocab size which was 40405 (Seems random though?)

  4. Created learner, finetuned and saved encoder.

  5. Created TextFile for classification

  6. Created learner and when I tried to load the encoder I received size mismatch error. Also vocab size was different this time.
    RuntimeError: Error(s) in loading state_dict for MultiBatchRNNCore: size mismatch for encoder.weight:

So I am not able to load the encoder? Due to the different size of vocab? If yes then how can I solve the issue? Also I checked IMDB notebook learners and seems like in both the cases (LM file and classification) vocab size is greater than 60 thousand.

You can find the notebooks here in this repo

Thanks

Okay, I made it work. Apparently, I provided my language model vocab to the classifier with vocab=data_lm.vocab and it seems like things are working now.

So it is a good idea to train you classifier with the same vocab that you used to fine tune your language model.

EDIT: Just now Jeremy has updated the notebook with the vocab=data_lm.vocab argument in IMDB.

5 Likes

Hi,
Can you please share your notebook or code to resolve this issue, i am also facing the same problem.

Thanks.

1 Like