Unofficial Lecture 10 Notes


(Tim Lee) #1

Unofficial Lecture 10 Notes

Hi all, hope these notes are useful. This is from the morning version of the lecture.

  • Tim

Logistic Regression

%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.nlp import *
from sklearn.linear_model import LogisticRegression
# from torchtext import vocab, data, datasets
PATH='aclImdb/'
names = ['neg','pos']
trn,trn_y = texts_from_folders(f'{PATH}train',names)
val,val_y = texts_from_folders(f'{PATH}test',names)
veczr = CountVectorizer(tokenizer=tokenize)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)
x=trn_term_doc
y=trn_y

Recapping the Logistic Regression (C = constant)

Make sure you have a lot of parameters. Linear models can work sometimes, if the underlying assumption of linearity isn’t too bad. Works much better with neural networks and more hidden layers.

For sentiment, if you look at a set of words, only need a few words to tell us if the documents is either happy or sad. Have lots of parameters and use regularization

m = LogisticRegression(C=1e8, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()
0.85511999999999999

Trigrams

veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)
x=trn_term_doc
y=trn_y

Note: the x input is all 0, or counts


x = [[3 0 0 1...  
      0 0 0 1
      0 2 1 0   ]]

m = LogisticRegression(C=1e8, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()
0.89903999999999995

From Naive Bays

y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()

p = x[y==1].sum(0)+1
q = x[y==0].sum(0)+1
r = np.log((p/p.sum())/(q/q.sum()))
b = np.log(len(p)/len(q))
vocab = veczr.get_feature_names()

How happy or sad everything is

np.exp(r)
matrix([[ 0.95208,  0.85605,  0.78485, ...,  3.01678,  0.5028 ,  0.5028 ]])
vocab[-3]
'” .'
vocab[3]
'! ! "'

All the ones will be replaced with the log count ratio

Now consider a different input


x = [[0 0    0    0.45 ...  
      0 0    0    0.45
      0 0.65 0.45 0   ]]

x_nb = x.multiply(r)
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y);

val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()
0.91768000000000005

Why is the binarized version better

the r measure is a good estimate of how positive or negative it is. Regularization moves coefficients toward 0. 0 is your prior is your expectation about what hte world is like:

Feature Vector
[1 1 0 0
 1 0 0 0
 1 0 0 1 ]

r = [0 .4 0 .4]
Feature Vector x r (positive / negative vectors)
[0 .4 0   0
 0  0 0   0
 0  0 0 -.4]

Whats the difference between training on (Feature Vector) vs. (Feature Vector x r)

Since the multiplied version. If the cofficients are the same use the R value, which is naive bayes. Otherwise there’s a subtle starting point difference. And under regularization, there’s a varying effect, and will apply variables differently.

This is called NBSVM

Paper: https://nlp.stanford.edu/~sidaw/home/_media/papers:compareacl.pdf

Stronger Still

maximum number of unique words in a review

sl = 2000
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)
  • wds - weight decay
  • learning rate : 0.02 - learning rate
learner = md.dotprod_nb_learner()
learner.fit(0.02,1,wds=1e-6, cycle_len=1)

Failed to display Jupyter Widget of type HBox.

If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.

If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.

[ 0.       0.02598  0.11969  0.91618]                         

2nd iteration

learner.fit(0.02,1,wds=1e-6, cycle_len=1)

Failed to display Jupyter Widget of type HBox.

If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.

If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.

[ 0.       0.02027  0.1131   0.92104]                         

3Rd iteration

learner.fit(0.02,1,wds=1e-6, cycle_len=1)

Failed to display Jupyter Widget of type HBox.

If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.

If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.

[ 0.       0.01779  0.11128  0.92225]                         

A Peak under the hood

class DotProdNB(nn.Module):
    def __init__(self, nf, ny, w_adj=0.4, r_adj=10):
        super().__init__()
        self.w_adj,self.r_adj = w_adj,r_adj
        self.w = nn.Embedding(nf+1, 1, padding_idx=0)
        self.w.weight.data.uniform_(-0.1,0.1)
        self.r = nn.Embedding(nf+1, ny)

    def forward(self, feat_idx, feat_cnt, sz):
        w = self.w(feat_idx)
        r = self.r(feat_idx)
        x = ((w+self.w_adj)*r/self.r_adj).sum(1)
        return F.softmax(x)

This looks very similar to the work we have done before

The weight adjustment, if not provided, will be w_adj=0.4, and for scaling, we will use r_ad=10. These constants have been identifyed through experimentation

def __init__(self, nf, ny, w_adj=0.4, r_adj=10):

Matrix multiplication

 x = ((w+self.w_adj)*r/self.r_adj).sum(1) <-- this is just a matrix multiplication

Let’s talk about whats happening underneath

We want to construct a logistic regression. And currently what we have is a bag of words. But memory becomes an issue now. Even a small corpus can have a large number of documents and a large number of features (unique words).

Bad way

Every unique word is 1-hot encoded:


terms:
he is a dog
[1  0 0  0
 0  1 0  0
 0  0 1  0 
 0  0 0  1 ]

Doc representation

    he is a dog ..... more words
docs [                              ]
docs [      big matrix x big matrix ]
docs [                              ]
docs [                              ]

If we multiply the 1-hot encoded by the weight matrix. Selecting that column. A 1-hot matrix multiplier. Looking up a column in an array.

More efficient way:

Don’t store the 1-hot encoded variables. Instead we store the docs in a single vector with the term_index (see below)

doc1: a dog is he : [3,4,2,1]
doc2: he is dog : [1,2,4]
doc3: dog is dog : [4,2,4]

1 row for every feature, how every many activations, lets look at each of those word indexes and grab the column out of the weights. This is in contrast to doing the long matrix multiplication

And these docs will pull the corresponding vector. Each weight vector is the

Denoising Auto Encoder

Structured data

Covering Rossman from Deep Learning 3

Turning time series into an embedding? How big / how wide?

Cardinality of the features:

[('Store', 1116),
 ('DayOfWeek', 8),
 ('Year', 4),
 ('Month', 13),
 ('Day', 32),
 ('StateHoliday', 3),
 ('CompetitionMonthsOpen', 26),
 ('Promo2Weeks', 27),
 ('StoreType', 5),
 ('Assortment', 4),
 ('PromoInterval', 4),
 ('CompetitionOpenSinceYear', 24),
 ('Promo2SinceYear', 9),
 ('State', 13),
 ('Week', 53),
 ('Events', 22),
 ('Promo_fw', 7),
 ('Promo_bw', 7),
 ('StateHoliday_fw', 4),
 ('StateHoliday_bw', 4),
 ('SchoolHoliday_fw', 9),
 ('SchoolHoliday_bw', 9)]

Embedding sizes (per each field)

[(1116, 50),
 (8, 4),
 (4, 2),
 (13, 7),
 (32, 16),
 (3, 2),
 (26, 13),
 (27, 14),
 (5, 3),
 (4, 2),
 (4, 2),
 (24, 12),
 (9, 5),
 (13, 7),
 (53, 27),
 (22, 11),
 (7, 4),
 (7, 4),
 (4, 2),
 (4, 2),
 (9, 5),
 (9, 5)]

But First lets look at the feature engineering

Feature engineering - Rossman

Bring everything you can think of!

train, store, store_states, state_names, googletrend, weather, test = tables

Turn the features into human readable

train.StateHoliday = train.StateHoliday!='0'
test.StateHoliday = test.StateHoliday!='0'

Note about adding in additional data, always check to see if there’s things that don’t match:

A small wrapper for the pandas.DataFrame.merge or LEFT JOIN

def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))
  • Could have the wrong number of rows
  • or missing indexes
store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])

Also note, check that don’t lose rows from doing repeated left joins. Also avoid cartesian joins (non-unique join fields) aka MAKE new rows. Also from the merge, be sure to remove the duplicate columns

Memory:

If your joins take too long, consider loading into a SQL database for better joins. Pandas is all in ram, and SQL is designed not to work all in memory, so can handle larger datasets.

Though when working from strings, you may consider doing hte processing in pandas, as you have full access to the python string library tools

Data that was added by joins:

  1. state names
  2. Google trend on weather
  3. parse out the different date elements, such as day of the week, month etc.
  4. Google trend for state
  5. google trend for all of germany
  6. days since last competition open since (calculation)

Categorical vs. Continuous

If we treat something as a category, we will be giving it an embedding matrix. We will essentially be developing a description vector. If its continuous, we need to find a underlying function that relates to it. Words easily fall into this categorical.

Some more examples of embeddings. Note that these different color categories are arbitrary and must be interpretted after the fact. The dimension (how many colors, or royalty etc. ) must be chosen beforehand and trained on a deep learning

How do you deal with dates?Turning day of the week data into an embedding

Proximity to holiday. Either upcoming or previous.

  • How long has it been since the last promotion?
  • How long till the next promotion?
def get_elapsed(fld, pre):
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()
    last_store = 0
    res = []

    for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):
        if s != last_store:
            last_date = np.datetime64()
            last_store = s
        if v: last_date = d
        res.append(((d-last_date).astype('timedelta64[D]') / day1).astype(int))
    df[pre+fld] = res

Using the function for state holiday and school holiday

fld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

fld = 'StateHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

Moving Averages (take a 7 day period … create a rolling sum) AKA window functions

bwd = df[['Store']+columns].sort_index().groupby("Store").rolling(7, min_periods=1).sum()

Wiki / Lesson Thread: Lesson 10
About the Intro to Machine Learning category