Remote NLP Study Group meets Saturdays at 8 AM PST, starting 12/14/2019

jcatanza · January 4, 2020, 1:18am

vitojph · January 4, 2020, 3:47pm

Hi,

I won’t make it today to join the group, but let me share with you folks a great summary of the NLP scene in 2019 https://medium.com/dair-ai/nlp-year-in-review-2019-fb8d523bcb19

jcatanza · January 5, 2020, 5:14pm

The revised and annotated notebook 2b_odds_and_ends_jcat.ipynb that we worked from during the 1/04/2020 meetup is now available in our study group’s git repository.

I guarantee that working through it will help with your understanding of lesson 2!

By the way, if your New Year’s Resolution is to learn NLP, now is a great time to make a start!

Just Do It! Join the NLP Study Group. We are making our way through the lessons at a leisurely pace – there is plenty of time to catch up!

foobar8675 · January 6, 2020, 4:17pm

@jcatanza take 2 : sorry for the poorly worded question

I’m looking at lecture 4 (7:24) https://youtu.be/hp2ipC5pW4I and Rachel talks about the PAD special token. I copied the definition below.

PAD (xxpad) is the token used for padding, if we need to regroup several texts of different lengths in a batch

I can’t follow the definition and my googling is not helping to find another explanation. Is there another way to explain what this token means?

jcatanza · January 7, 2020, 12:24am

Hi @foobar8675 could you please restate your question a bit more clearly? I think I have the gist, but I need clarification. Thanks!

jcatanza · January 7, 2020, 3:05am

The Fastai NLP Study Group will meet
Saturday January 11, at 8 AM PST, 11 AM EST, 5 PM CET, 9:30 PM IST

Join the Zoom Meeting when it’s time!

Topic: Sentiment Classification with Naïve Bayes

Suggested homework / preparation:

1. Watch NLP video #4

Video playlist is here

2. Read and work through the notebook Sentiment Classification of Movie Reviews (using Naive Bayes, Logistic Regression, and Ngrams) up to but not including the Naïve Bayes section

Course notebooks are available on github

To join via Zoom phone
Dial US: +1 669 900 6833 or +1 646 876 9923
Meeting ID: 832 034 584

The current meetup schedule is here.

Sign up here to receive meetup announcements via email.

wyquek · January 7, 2020, 4:08am

see if this explanation by Rachel from video 18 helps. She explained the padding (watch about 15 seconds of it).

foobar8675 · January 7, 2020, 3:54pm

That is very helpful. Thanks @wyquek!

foobar8675 · January 8, 2020, 7:41pm

I wasn’t sure what an empty row with the Compressed Sparse Row https://youtu.be/hp2ipC5pW4I would look like so I googled it and found this. https://stackoverflow.com/questions/43771387/compressed-sparse-row-csr-how-do-you-store-empty-rows and wrote an example.

so given that, if the first row in Rachel’s example, 22, 23, 25 were all empty, then would the first 4 RowPtrs should look like 0,3,3,6

just thought I’d share - since I wasn’t sure myself.

jcatanza · January 10, 2020, 12:52am

Brilliant. Using a GPT-2 language model to play chess.
This is why I love AI.

jcatanza · January 10, 2020, 3:00am

No, you’re not stupid if you were confused by the explanation of CSR (Compressed Sparse Row) representation given by the Emory University website that was discussed in video #4.

The reason is that the authors gave a sloppy and incomplete definition of CSR!!!

So, here is a proper explanation of CSR based on material from this Wikipedia article.

Given a full matrix A with m rows, n columns, and N nonzero values, the CSR (Compressed Sparse Row) representation is stored using three arrays as follows:

Val[0:N] contains the values of the N non-zero elements of A
Col[0:N] contains the column indices in A of the N non-zero elements.
RowPointer[0:m+1] For each row i of A, RowPointer[i] contains the index in Col of the first nonzero value in row i. If there are no nonzero values in the ith row, then RowPointer[i] = 0 And, by convention, an extra entry RowPointer[m] = N is tacked on at the end.

Question: How many floats and ints does it take to store the matrix A in CSR format?

jcatanza · January 14, 2020, 3:38am

I pushed an updated version of 3-logreg-nb-imdb_jcat.ipynb to the NLP Study Group repo. If you missed Saturday’s meetup (and even if you didn’t!), you can catch up with/review the material in video #4 (even if you haven’t watched it yet!) by reading , and running, and playing with this notebook, down to but not including Section 8: Naive Bayes.

miyabhai101 · January 14, 2020, 10:59am

looks like it starts next week

jcatanza · January 14, 2020, 6:48pm

Sorry I didn’t understand – what starts next week?

foobar8675 · January 17, 2020, 6:07pm

getting an error in 3-logreg-nb-imdb with this cell in colab

m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y.items.astype(int))
preds = m.predict(val_term_doc)
(preds==val_y).mean()

first error i got was

ValueError: Solver lbfgs supports only dual=False, got dual=True

so i naively set to False, since per the docs, that sounds like the right thing to do anyways

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

which is when i get a very different error

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
0.655

@jcatanza have you seen this error?

jcatanza · January 18, 2020, 3:28am

Hi @foobar8675
I obtained good results with the liblinear and newton-cg solvers. The other solvers got poorer results and failed to converge.

jcatanza · January 18, 2020, 1:19pm

Hi all. Due to a family emergency, I must cancel today’s (Saturday 1/18) NLP class. The class will resume next week as usual.

The good news is that I have refactored and annotated the 3-logreg-nb-imdb.ipynb notebook and pushed it to github here https://github.com/jcatanza/Fastai-A-Code-First-Introduction-To-Natural-Language-Processing-TWiML-Study-Group/blob/master/3-logreg-nb-imdb_jcat.ipynb

3-logreg-nb-imdb_jcat.ipynb is a self-contained tutorial on Naive Bayes and Logistic Regression applied to the IMDb data. I think you’ll find it useful!

Today’s assignment: please get the 3-logreg-nb-imdb_jcat.ipynb notebook and use the 1.5 class hours to read, run, play with, and learn from it!

Have a great weekend, and I’ll see you next week.

foobar8675 · January 20, 2020, 8:12pm

that makes sense. thanks!

jcatanza · January 20, 2020, 8:38pm

The Fastai NLP Study Group will meet
Saturday January 25, at 8 AM PST, 11 AM EST, 5 PM CET, 9:30 PM IST

Join the Zoom Meeting when it’s time!

Topic: Sentiment Classification with Naïve Bayes and Logistic Regression

Suggested homework / preparation:

Watch video #5; Video playlist is here
Read and work through my extensively refactored and annotated version of the 3-logreg-nb-imdb.ipynb notebook
Note: in order to run my version of the notebook you’ll need to fork or clone the study group repository

To join via Zoom phone
Dial US: +1 669 900 6833 or +1 646 876 9923
Meeting ID: 832 034 584

The current meetup schedule is here.

Sign up to receive meetup announcements via email.

jcatanza · January 20, 2020, 10:39pm

You can read (but not run) the notebook for this week’s discussion in nbviewer.