Will that be used later to determine the labels of the test file?
I was thinking this was like last course on NLP (part I)? Since it was imdb.ipynb.?
Number of records.
In comparison to part 1 IMDB nb. That nb took a while to train!
If it is used that way, why not just assign the label instead of setting them all to zero?
My bad, just did another git pull, forgot to update -_-.
Why can’t we simple add a new dimension in the embedding for each word denoting whether the word was uppercased or not and finetune that as well?
Applause on the t_up trick.
So, are we essentially learning a (semantic) markup along with the plain text?
What happens if we load a new set and the new set includes the words removed because 2 or less repetitions?
Are numbers of the text also been changed to another value(numericalised) ? If not how the model does not that a number represent a word or a real number originally written in the text ?
Is there a way to introduce context from outside of the corpus?
always use the same tokenizer + vocab for future text inputs.
What if the numpy array does not fit in memory? Is it possible to write a pytorch dataloader directly from a large csv file?
Totally valid too. I’d be interested in pros/cons.
Is it weird that I got an error when doing a recent git pull on this repo? Seemed odd. But my guess is bc I’m doing this locally on CPU as opposed to at home with my GPU? Things were fine last week during class.
NoPackagesFoundError: Package missing in current osx-64 channels: - cuda90
For those who don’t have the previous IMDB notebook executed for some reason: you need to download the SpaCy English language model (GitHub issue link):
python -m spacy download en
You might have to use the conda env update -f environment-cpu.yml
on your Macbook.