Unofficial Lecture 10 Notes

Unofficial Lecture 10 Notes

Hi all, hope these notes are useful. This is from the morning version of the lecture.

  • Tim

Logistic Regression

%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.nlp import *
from sklearn.linear_model import LogisticRegression
# from torchtext import vocab, data, datasets
PATH='aclImdb/'
names = ['neg','pos']
trn,trn_y = texts_from_folders(f'{PATH}train',names)
val,val_y = texts_from_folders(f'{PATH}test',names)
veczr = CountVectorizer(tokenizer=tokenize)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)
x=trn_term_doc
y=trn_y

Recapping the Logistic Regression (C = constant)

Make sure you have a lot of parameters. Linear models can work sometimes, if the underlying assumption of linearity isn’t too bad. Works much better with neural networks and more hidden layers.

For sentiment, if you look at a set of words, only need a few words to tell us if the documents is either happy or sad. Have lots of parameters and use regularization

m = LogisticRegression(C=1e8, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()
0.85511999999999999

Trigrams

veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)
x=trn_term_doc
y=trn_y

Note: the x input is all 0, or counts


x = [[3 0 0 1...  
      0 0 0 1
      0 2 1 0   ]]

m = LogisticRegression(C=1e8, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()
0.89903999999999995

From Naive Bays

y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()

p = x[y==1].sum(0)+1
q = x[y==0].sum(0)+1
r = np.log((p/p.sum())/(q/q.sum()))
b = np.log(len(p)/len(q))
vocab = veczr.get_feature_names()

How happy or sad everything is

np.exp(r)
matrix([[ 0.95208,  0.85605,  0.78485, ...,  3.01678,  0.5028 ,  0.5028 ]])
vocab[-3]
'” .'
vocab[3]
'! ! "'

All the ones will be replaced with the log count ratio

Now consider a different input


x = [[0 0    0    0.45 ...  
      0 0    0    0.45
      0 0.65 0.45 0   ]]

x_nb = x.multiply(r)
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y);

val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()
0.91768000000000005

Why is the binarized version better

the r measure is a good estimate of how positive or negative it is. Regularization moves coefficients toward 0. 0 is your prior is your expectation about what hte world is like:

Feature Vector
[1 1 0 0
 1 0 0 0
 1 0 0 1 ]

r = [0 .4 0 .4]
Feature Vector x r (positive / negative vectors)
[0 .4 0   0
 0  0 0   0
 0  0 0 -.4]

Whats the difference between training on (Feature Vector) vs. (Feature Vector x r)

Since the multiplied version. If the cofficients are the same use the R value, which is naive bayes. Otherwise there’s a subtle starting point difference. And under regularization, there’s a varying effect, and will apply variables differently.

This is called NBSVM

Paper: https://nlp.stanford.edu/~sidaw/home/_media/papers:compareacl.pdf

Stronger Still

maximum number of unique words in a review

sl = 2000
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)
  • wds - weight decay
  • learning rate : 0.02 - learning rate
learner = md.dotprod_nb_learner()
learner.fit(0.02,1,wds=1e-6, cycle_len=1)

Failed to display Jupyter Widget of type HBox.

If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.

If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.

[ 0.       0.02598  0.11969  0.91618]                         

2nd iteration

learner.fit(0.02,1,wds=1e-6, cycle_len=1)

Failed to display Jupyter Widget of type HBox.

If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.

If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.

[ 0.       0.02027  0.1131   0.92104]                         

3Rd iteration

learner.fit(0.02,1,wds=1e-6, cycle_len=1)

Failed to display Jupyter Widget of type HBox.

If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.

If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.

[ 0.       0.01779  0.11128  0.92225]                         

A Peak under the hood

class DotProdNB(nn.Module):
    def __init__(self, nf, ny, w_adj=0.4, r_adj=10):
        super().__init__()
        self.w_adj,self.r_adj = w_adj,r_adj
        self.w = nn.Embedding(nf+1, 1, padding_idx=0)
        self.w.weight.data.uniform_(-0.1,0.1)
        self.r = nn.Embedding(nf+1, ny)

    def forward(self, feat_idx, feat_cnt, sz):
        w = self.w(feat_idx)
        r = self.r(feat_idx)
        x = ((w+self.w_adj)*r/self.r_adj).sum(1)
        return F.softmax(x)

This looks very similar to the work we have done before

The weight adjustment, if not provided, will be w_adj=0.4, and for scaling, we will use r_ad=10. These constants have been identifyed through experimentation

def __init__(self, nf, ny, w_adj=0.4, r_adj=10):

Matrix multiplication

 x = ((w+self.w_adj)*r/self.r_adj).sum(1) <-- this is just a matrix multiplication

Let’s talk about whats happening underneath

We want to construct a logistic regression. And currently what we have is a bag of words. But memory becomes an issue now. Even a small corpus can have a large number of documents and a large number of features (unique words).

Bad way

Every unique word is 1-hot encoded:


terms:
he is a dog
[1  0 0  0
 0  1 0  0
 0  0 1  0 
 0  0 0  1 ]

Doc representation

    he is a dog ..... more words
docs [                              ]
docs [      big matrix x big matrix ]
docs [                              ]
docs [                              ]

If we multiply the 1-hot encoded by the weight matrix. Selecting that column. A 1-hot matrix multiplier. Looking up a column in an array.

More efficient way:

Don’t store the 1-hot encoded variables. Instead we store the docs in a single vector with the term_index (see below)

doc1: a dog is he : [3,4,2,1]
doc2: he is dog : [1,2,4]
doc3: dog is dog : [4,2,4]

1 row for every feature, how every many activations, lets look at each of those word indexes and grab the column out of the weights. This is in contrast to doing the long matrix multiplication

And these docs will pull the corresponding vector. Each weight vector is the

Denoising Auto Encoder

Structured data

Covering Rossman from Deep Learning 3

Turning time series into an embedding? How big / how wide?

Cardinality of the features:

[('Store', 1116),
 ('DayOfWeek', 8),
 ('Year', 4),
 ('Month', 13),
 ('Day', 32),
 ('StateHoliday', 3),
 ('CompetitionMonthsOpen', 26),
 ('Promo2Weeks', 27),
 ('StoreType', 5),
 ('Assortment', 4),
 ('PromoInterval', 4),
 ('CompetitionOpenSinceYear', 24),
 ('Promo2SinceYear', 9),
 ('State', 13),
 ('Week', 53),
 ('Events', 22),
 ('Promo_fw', 7),
 ('Promo_bw', 7),
 ('StateHoliday_fw', 4),
 ('StateHoliday_bw', 4),
 ('SchoolHoliday_fw', 9),
 ('SchoolHoliday_bw', 9)]

Embedding sizes (per each field)

[(1116, 50),
 (8, 4),
 (4, 2),
 (13, 7),
 (32, 16),
 (3, 2),
 (26, 13),
 (27, 14),
 (5, 3),
 (4, 2),
 (4, 2),
 (24, 12),
 (9, 5),
 (13, 7),
 (53, 27),
 (22, 11),
 (7, 4),
 (7, 4),
 (4, 2),
 (4, 2),
 (9, 5),
 (9, 5)]

But First lets look at the feature engineering

Feature engineering - Rossman

Bring everything you can think of!

train, store, store_states, state_names, googletrend, weather, test = tables

Turn the features into human readable

train.StateHoliday = train.StateHoliday!='0'
test.StateHoliday = test.StateHoliday!='0'

Note about adding in additional data, always check to see if there’s things that don’t match:

A small wrapper for the pandas.DataFrame.merge or LEFT JOIN

def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))
  • Could have the wrong number of rows
  • or missing indexes
store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])

Also note, check that don’t lose rows from doing repeated left joins. Also avoid cartesian joins (non-unique join fields) aka MAKE new rows. Also from the merge, be sure to remove the duplicate columns

Memory:

If your joins take too long, consider loading into a SQL database for better joins. Pandas is all in ram, and SQL is designed not to work all in memory, so can handle larger datasets.

Though when working from strings, you may consider doing hte processing in pandas, as you have full access to the python string library tools

Data that was added by joins:

  1. state names
  2. Google trend on weather
  3. parse out the different date elements, such as day of the week, month etc.
  4. Google trend for state
  5. google trend for all of germany
  6. days since last competition open since (calculation)

Categorical vs. Continuous

If we treat something as a category, we will be giving it an embedding matrix. We will essentially be developing a description vector. If its continuous, we need to find a underlying function that relates to it. Words easily fall into this categorical.

Some more examples of embeddings. Note that these different color categories are arbitrary and must be interpretted after the fact. The dimension (how many colors, or royalty etc. ) must be chosen beforehand and trained on a deep learning

How do you deal with dates?Turning day of the week data into an embedding

Proximity to holiday. Either upcoming or previous.

  • How long has it been since the last promotion?
  • How long till the next promotion?
def get_elapsed(fld, pre):
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()
    last_store = 0
    res = []

    for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):
        if s != last_store:
            last_date = np.datetime64()
            last_store = s
        if v: last_date = d
        res.append(((d-last_date).astype('timedelta64[D]') / day1).astype(int))
    df[pre+fld] = res

Using the function for state holiday and school holiday

fld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

fld = 'StateHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

Moving Averages (take a 7 day period … create a rolling sum) AKA window functions

bwd = df[['Store']+columns].sort_index().groupby("Store").rolling(7, min_periods=1).sum()
3 Likes

hi I wonder if TextClassifierData.from_bow() still exists in the current version of fastai? Thanks!

should it it be sum(0)? since the output of w*r is (num of features,2) we should sum across all rows?
also where does feat_idx and feat_cnt feed in?

This is a very intersting question and I spent three days just trying to understand how this thing works. here is my findings for now, first feat_idx are the inputs to the embedding layers from the Dataloader, feat_cnt and sz were not used. for the sum I will include the short answer and if you need more explanation for the whole thing you can just tell me. technically your are taking the sum over the ‘‘rows’’ because the result of the embedding layer(s) is a rank 3 tensor in this case the result is a matrix of size (25,000x2000x2) so dimension one in this case will be the 2000 features for each review (we chose this number earlier when we built the dataloader) which is equivalent take the sum over the rows in a rank 2 tensor (25,000x2,000). so if you have imagined it exactly like a linear layer like I did, you are expecting a rank 2 tensor which is not the case. the concept of the embedding is the same as the linear layer but how this thing is evaluated is different.

would you mind adding me on ccjobhunter2011@gmail.com ? thanks

I know this is and old thread but I have a question regarding the Lesson 10 of the ML Course of 2018.
I find a little bit confusing the Python code @jeremy uses to compute the Naive Bayes r coefficient and
I would like to see if I am missing something.

In the excel sheet as far as I understand each of the features (features = unique words in the vocabulary which I will call V) is considered as a Bernoulli Random Variable independent of the rest given the class (+ class for example) of the document:

p(document| class = +) = \prod_{i=1}^{V} p(f_i| class = +) p(class = +)

So, in the excell sheet the p(f_i| class = +) is computed as the proportion of times the word i = f_i appears in the documents of positive class. So, each of the words or features p(f_i| class = +) is a proabbility between 0 and 1 and they do not add up to 1. This model is called Bernoulli Naive Bayes because each of the features given the class are considered a Bernoulli RV and the documents are represented as vectors of length the size of the vocabulary.

On the other hand, analyzing the code (video version):

x = trn_term_doc
y = trn_y
p = x[y==1].sum(0) + 1
q = x[y==0].sum(0) + 1
r = np.log((p/p.sum())/(q/q.sum()))
b = np.log(len(p)/len(q))

My confusion comes by the division p/p.sum()) so all the probabilities sum to 1.
This is like computing a Multinoulli Random Variable but without using the way of encoding what is called the Multinomial Naive Bayes. But this is not quite because, instead of a f_i being the word number i of the vocabulary so each document is a sparse vector of length |V| = number of unique words in the training set and each one being parametrised as a Bernoulli RV as in the Bernoulli Naive Bayes, in the Multinomial NB the documents are represented in a different way as vector of length the number of tokens/words, and each one being parametrised as a Multinoulli RV and the code is not either of the two.
So, anyone could explain why this is the way of doing it and some intuition behind this code?

Thank you very much

You can refer to this page Theory and example about Naive Bayes Classifier. Actually the excel and the code is inconsistent, the latter conforms to normal probability theory, while the former does not.