A Code-First Introduction to Natural Language Processing 2019

just got a little distracted with the sparsity pattern of the doc-term matrix in 3-logreg-nb-imdb:
I think we see the density ripples because “Elements with equal counts are ordered in the order first encountered” - so tokens are ordered by usage frequency then by when they were 1st encountered.

If we zoom right in ↓ we can see new tokens added by the 1st few docs get consecutively assigned indexes - (in the plot above ↑ this is what makes the edge of the ripple dense):
If we remove doc order from the plot with np.random.shuffle(A), we loose the ripples:

I’m trying to run the nn-vietnamese notebook, but the get_wiki function won’t work. It returns the following error:
FileNotFoundError: [Errno 2] No such file or directory: ‘/root/.fastai/data/viwiki/text/AA/wiki_00’

This is after “extracting…” gets printed, so I’d assume this is the step:
shutil.move(str(path/‘text/AA/wiki_00’), str(path/name))

I’m using Google Colab, GPU-enabled.

Any idea what the error might be? I’d suspect it’s Colab running out of space, but the download and upzipping seem to work fine.

There is an issue with the (latest) wikiextractor version.
I manually downloaded wikiextractor from this commit: attardi/wikiextractor at e4abb4cbd019b0257824ee47c23dd163919b731b (github.com)
Just replace the files created by the nlputils script.

Hi! If I want to run a notebook (5-nn-imdb) on a cloud, how should I deal with this “Unzip it into the .fastai/data/ folder on your computer.”? Where I can put wikitext?

The wikiextractor is not working.

This does not work.

Thanks for the list. I just started running the notebooks in course materials and it seems some modules are deprecated (or their names have changed). For example, in 2-svd-nmf-topic-modeling.ipynb

from sklearn.feature_extraction import stop_words

is now

from sklearn.feature_extraction import _stop_words

Is it possible to know the version of all the libraries mentioned above?
P.S. I’m using scikit-learn 0.24.2

Take a look at the kaggle dataset giga-fren | Kaggle

This dataset has a file called giga-fren.csv which I believe is the same as questions_easy.csv

Hope this helps!

Is this course deprecated, obsolete and/or unsupported?