Part 2 Lesson 10 wiki

Will that be used later to determine the labels of the test file?

I was thinking this was like last course on NLP (part I)? Since it was imdb.ipynb.?

1 Like

Number of records.

The notebook is here

3 Likes

In comparison to part 1 IMDB nb. That nb took a while to train!

If it is used that way, why not just assign the label instead of setting them all to zero?

My bad, just did another git pull, forgot to update -_-.

The data is here
http://files.fast.ai/data/aclImdb.tgz

1 Like

Why can’t we simple add a new dimension in the embedding for each word denoting whether the word was uppercased or not and finetune that as well?

5 Likes

Applause on the t_up trick.

9 Likes

So, are we essentially learning a (semantic) markup along with the plain text?

3 Likes

What happens if we load a new set and the new set includes the words removed because 2 or less repetitions?

Are numbers of the text also been changed to another value(numericalised) ? If not how the model does not that a number represent a word or a real number originally written in the text ?

1 Like

Is there a way to introduce context from outside of the corpus?

3 Likes

always use the same tokenizer + vocab for future text inputs.

2 Likes

What if the numpy array does not fit in memory? Is it possible to write a pytorch dataloader directly from a large csv file?

5 Likes

Totally valid too. I’d be interested in pros/cons.

Is it weird that I got an error when doing a recent git pull on this repo? Seemed odd. But my guess is bc I’m doing this locally on CPU as opposed to at home with my GPU? Things were fine last week during class.

NoPackagesFoundError: Package missing in current osx-64 channels: - cuda90

For those who don’t have the previous IMDB notebook executed for some reason: you need to download the SpaCy English language model (GitHub issue link):

python -m spacy download en
7 Likes

You might have to use the conda env update -f environment-cpu.yml on your Macbook.

1 Like