Hi all, likely a very nooby question but I am currently working my way through the lesson4-imdb notebook. When I download the dataset using the link provided ((http://files.fast.ai/data/aclImdb.tgz), I see the train, test folders but no models folder. This results in an error while executing the following line: pickle.dump(TEXT, open(f’{PATH}models/TEXT.pkl’,‘wb’)).
This is simply a result of demonstration notebooks being like real notebooks, with parts repeated, skipped, or changed. The notebooks are meant to go along with the talk. Further down in the sentiment section you can see a fit() function having its cycles saved with cycle_save_name= and subsequently loaded with load_cycle(). The load that isn’t working for you is because the notebook isn’t showing a preceding save. You don’t need to save and load, it is just there to save time. Debugging some of the wrinkles in the notebooks actually proves to be a good learning experience, so hang in there.
Thanks RobG. I agree completely about the debugging of the notebooks. I watched the talk right through, then worked on the notebook. I probably need to watch the talk again while taking the first pass through the notebook.
About the structure of the way the data is arranged and fed to the GPU,
Jeremy says context matters, but it looks like he gets the sentences to go down the columns.
But then it takes a slice across 64 columns and feeds this into the GPU?
How is this not slicing up the context 64 times?
Is it “reading” down the columns even though it is getting the sentences 64 at a time?
When I was reading about splitting the imdb into training and test, it mentioned specifically avoiding reviews about the same movie being in both sets to avoid leakage. Otherwise, the model learned that “Godfather” means a high sentiment instead of learning how the English language defines sentiment. When we train the language model on 80% of the validation set, wouldn’t we also get at least some data leakage? I.E. couldn’t the language model not just be learning about the english language, but also what kind of words follow “Godfather”? Could data leakage be responsible for the extra boost instead of the language model?
If anyone has or can reproduce the performance without training the language model on the validation set, I would be very curious if the boost is robust.
FYI, in the notebook, the train/test split occurs right after “We first concat all the train(pos/neg/unsup = 75k) and test(pos/neg=25k) reviews into a big chunk of 100k reviews.”
I think you could just change:
trn_texts,val_texts = sklearn.model_selection.train_test_split(
np.concatenate([trn_texts,val_texts]), test_size=0.1)
to:
trn_texts,val_texts = sklearn.model_selection.train_test_split(
np.concatenate([trn_texts]), test_size=0.1)