DeepLearning-Lec4Notes

timlee · November 21, 2017, 4:42pm

Hi All,

If you want the Jupyter notebook versions of the lecture notes, I have saved a versions on my github here.

Here’s lecture 4 notes. @vikbehal, I couldn’t edit your post, so I had to post another one. @jeremy, please turn this to a wiki thread so others can edit? Thank you!

Tim

Unofficial Deep Learning Lecture 4 Notes

Today’s topics - how to use the tools for these applications.

Structured Learning
Deep Learning for Language
Deep Learning for Recommendation

Aside

Dropout

Vocab

activations - A number that’s been calculated.

In lesson 2, Jeremy said, “An activation is a number saying the confidence of some level of, say, eyeballness.”. Activations are also sometimes referred to as parameters or features.

Looking at the Dog Breeds CNN model, let’s look at the different layers

Looking at Dogbreeds lets’ look at the layers:

(0)-BatchNorm
(1)-Dropout (p = 0.25) <----
(2)-Linear
(3)-ReLU
(4)-BatchNorm
(5)-Dropout (p = 0.5)
(6)-Linear 
(7)-LogSoftMax

That means we go through and pick some activations and delete some of the calculated numbers.

p = whats the probability that you drop some cells.

Note: Each mini-batch we throw again a different set of activations.

p = 0.00 - No drop
p = 0.01 - will drop 1% of your activations (train well, but not very general)
p = 0.99 - will drop out 99% of your activations, will kill your accuracy (super general)

If you find it overfitting, increase p.

As you start using bigger models you will probably need to increase the amount of dropout.

You may notice we have two linear layers. We’ve been adding 2 linear layers, xtra_fc=[] can handle this additional layer.

If we pass in the blank xfra_fc, we get a minimum model, which is the simplest sequential model:

(0)-BatchNorm
(1)-Linear
(2)-LogSoftMax

ps = [0, 0.2] can set the different drop out layers for the ConvLearner parameters

1. Structured and Time Series Data

%matplotlib inline
%reload_ext autoreload
%autoreload 2

import sys
sys.path.append('/home/paperspace/repos/fastai/')

import torch

from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)

PATH='/home/paperspace/Desktop/data/rossman/'

Create datasets

We will look at the rossman kaggle competition as an example of structured data

In addition to the provided data, we will be using external datasets put together by participants in the Kaggle competition. You can download all of them here.

For completeness, the implementation used to put them together is included below.

def concat_csvs(dirname):
    path = f'{PATH}{dirname}'
    filenames=glob.glob(f"{path}/*.csv")

    wrote_header = False
    with open(f"{path}.csv","w") as outputfile:
        for filename in filenames:
            name = filename.split(".")[0]
            with open(filename) as f:
                line = f.readline()
                if not wrote_header:
                    wrote_header = True
                    outputfile.write("file,"+line)
                for line in f:
                     outputfile.write(name + "," + line)
                outputfile.write("\n")

Feature Space:

train: Training set provided by competition
store: List of stores
store_states: mapping of store to the German state they are in
List of German state names
googletrend: trend of certain google keywords over time, found by users to correlate well w/ given data
weather: weather
test: testing set

table_names = ['train', 'store', 'store_states', 'state_names', 
               'googletrend', 'weather', 'test']

We’ll be using the popular data manipulation framework pandas. Among other things, pandas allows you to manipulate tables/data frames in python as one would in a database.

We’re going to go ahead and load all of our csv’s as dataframes into the list tables.

tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]

from IPython.display import HTML

We can use head() to get a quick look at the contents of each table:

train: Contains store information on a daily basis, tracks things like sales, customers, whether that day was a holdiay, etc.
store: general info about the store including competition, etc.
store_states: maps store to state it is in
state_names: Maps state abbreviations to names
googletrend: trend data for particular week/state
weather: weather conditions for each state
test: Same as training table, w/o sales and customers

for t in tables: display(t.head())

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	SchoolHoliday
0	1	5	2015-07-31	5263	555	1	1	1
1	2	5	2015-07-31	6064	625	1	1	1
2	3	5	2015-07-31	8314	821	1	1	1
3	4	5	2015-07-31	13995	1498	1	1	1
4	5	5	2015-07-31	4822	559	1	1	1

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	c	a	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	a	a	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	a	a	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	c	c	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	a	a	29910.0	4.0	2015.0	0	NaN	NaN	NaN

	Store	State
0	1	HE
1	2	TH
2	3	NW
3	4	BE
4	5	SN

	StateName	State
0	BadenWuerttemberg	BW
1	Bayern	BY
2	Berlin	BE
3	Brandenburg	BB
4	Bremen	HB

	file	week	trend
0	Rossmann_DE_SN	2012-12-02 - 2012-12-08	96
1	Rossmann_DE_SN	2012-12-09 - 2012-12-15	95
2	Rossmann_DE_SN	2012-12-16 - 2012-12-22	91
3	Rossmann_DE_SN	2012-12-23 - 2012-12-29	48
4	Rossmann_DE_SN	2012-12-30 - 2013-01-05	67

	file	Date	Max_TemperatureC	Mean_TemperatureC	Min_TemperatureC	Dew_PointC	MeanDew_PointC	Min_DewpointC	Max_Humidity	Mean_Humidity	...	Max_VisibilityKm	Mean_VisibilityKm	Min_VisibilitykM	Max_Wind_SpeedKm_h	Mean_Wind_SpeedKm_h	Max_Gust_SpeedKm_h	Precipitationmm	CloudCover	Events	WindDirDegrees
0	NordrheinWestfalen	2013-01-01	8	4	2	7	5	1	94	87	...	31.0	12.0	4.0	39	26	58.0	5.08	6.0	Rain	215
1	NordrheinWestfalen	2013-01-02	7	4	1	5	3	2	93	85	...	31.0	14.0	10.0	24	16	NaN	0.00	6.0	Rain	225
2	NordrheinWestfalen	2013-01-03	11	8	6	10	8	4	100	93	...	31.0	8.0	2.0	26	21	NaN	1.02	7.0	Rain	240
3	NordrheinWestfalen	2013-01-04	9	9	8	9	9	8	100	94	...	11.0	5.0	2.0	23	14	NaN	0.25	7.0	Rain	263
4	NordrheinWestfalen	2013-01-05	8	8	7	8	7	6	100	94	...	10.0	6.0	3.0	16	10	NaN	0.00	7.0	Rain	268

5 rows × 24 columns

	Id	Store	DayOfWeek	Date	Open	Promo
0	1	1	4	2015-09-17	1.0	1
1	2	3	4	2015-09-17	1.0	1
2	3	7	4	2015-09-17	1.0	1
3	4	8	4	2015-09-17	1.0	1
4	5	9	4	2015-09-17	1.0	1

Data Cleaning / Feature Engineering

As a structured data problem, we necessarily have to go through all the cleaning and feature engineering, even though we’re using a neural network.

These are the consolidated Data processing blocks. For the originals, please refer to lesson 4 on deep learning.

train, store, store_states, state_names, googletrend, weather, test = tables

len(train),len(test)

We turn state Holidays to booleans, to make them more convenient for modeling. We can do calculations on pandas fields using notation very similar (often identical) to numpy.

Data Processing Blocks

train.StateHoliday = train.StateHoliday!='0'
test.StateHoliday = test.StateHoliday!='0'

def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

weather = join_df(weather, state_names, "file", "StateName")

googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]
googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]
googletrend.loc[googletrend.State=='NI', "State"] = 'HB,NI'

add_datepart(weather, "Date", drop=False)
add_datepart(googletrend, "Date", drop=False)
add_datepart(train, "Date", drop=False)
add_datepart(test, "Date", drop=False)
add_datepart(googletrend, "Date", drop=False)

trend_de = googletrend[googletrend.file == 'Rossmann_DE']

store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])

joined = join_df(train, store, "Store")
len(joined[joined.StoreType.isnull()])

joined = join_df(joined, googletrend, ["State","Year", "Week"])
len(joined[joined.trend.isnull()])

joined = joined.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
len(joined[joined.trend_DE.isnull()])

joined = join_df(joined, weather, ["State","Date"])
len(joined[joined.Mean_TemperatureC.isnull()])

joined_test = test.merge(store, how='left', left_on='Store', right_index=True)
len(joined_test[joined_test.StoreType.isnull()])

for c in joined.columns:
    if c.endswith('_y'):
        if c in joined.columns: joined.drop(c, inplace=True, axis=1)

joined.CompetitionOpenSinceYear = joined.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
joined.CompetitionOpenSinceMonth = joined.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
joined.Promo2SinceYear = joined.Promo2SinceYear.fillna(1900).astype(np.int32)
joined.Promo2SinceWeek = joined.Promo2SinceWeek.fillna(1).astype(np.int32)

joined["CompetitionOpenSince"] = pd.to_datetime(dict(year=joined.CompetitionOpenSinceYear, 
                                                     month=joined.CompetitionOpenSinceMonth, day=15))
joined["CompetitionDaysOpen"] = joined.Date.subtract(joined.CompetitionOpenSince).dt.days

joined.loc[joined.CompetitionDaysOpen<0, "CompetitionDaysOpen"] = 0
joined.loc[joined.CompetitionOpenSinceYear<1990, "CompetitionDaysOpen"] = 0

joined["CompetitionMonthsOpen"] = joined["CompetitionDaysOpen"]//30
joined.loc[joined.CompetitionMonthsOpen>24, "CompetitionMonthsOpen"] = 24
joined.CompetitionMonthsOpen.unique()

joined["Promo2Since"] = pd.to_datetime(joined.apply(lambda x: Week(
    x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1).astype(pd.datetime))
joined["Promo2Days"] = joined.Date.subtract(joined["Promo2Since"]).dt.days

joined.loc[joined.Promo2Days<0, "Promo2Days"] = 0
joined.loc[joined.Promo2SinceYear<1990, "Promo2Days"] = 0


joined["Promo2Weeks"] = joined["Promo2Days"]//7
joined.loc[joined.Promo2Weeks<0, "Promo2Weeks"] = 0
joined.loc[joined.Promo2Weeks>25, "Promo2Weeks"] = 25
joined.Promo2Weeks.unique()

joined.to_feather(f'{PATH}joined')

def get_elapsed(fld, pre):
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()
    last_store = 0
    res = []

    for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):
        if s != last_store:
            last_date = np.datetime64()
            last_store = s
        if v: last_date = d
        res.append(((d-last_date).astype('timedelta64[D]') / day1).astype(int))
    df[pre+fld] = res
    
    
columns = ["Date", "Store", "Promo", "StateHoliday", "SchoolHoliday"]
df = train[columns]

fld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

fld = 'StateHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

fld = 'Promo'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

df = df.set_index("Date")

columns = ['SchoolHoliday', 'StateHoliday', 'Promo']

for o in ['Before', 'After']:
    for p in columns:
        a = o+p
        df[a] = df[a].fillna(0)
        
bwd = df[['Store']+columns].sort_index().groupby("Store").rolling(7, min_periods=1).sum()
fwd = df[['Store']+columns].sort_index(ascending=False
                                      ).groupby("Store").rolling(7, min_periods=1).sum()

bwd.drop('Store',1,inplace=True)
bwd.reset_index(inplace=True)

fwd.drop('Store',1,inplace=True)
fwd.reset_index(inplace=True)

df.reset_index(inplace=True)

df = df.merge(bwd, 'left', ['Date', 'Store'], suffixes=['', '_bw'])
df = df.merge(fwd, 'left', ['Date', 'Store'], suffixes=['', '_fw'])

df.drop(columns,1,inplace=True)
df.to_csv(f'{PATH}df.csv')
df.head()

df["Date"] = pd.to_datetime(df.Date)
df.columns

Index(['Date', 'Store', 'AfterSchoolHoliday', 'BeforeSchoolHoliday',
       'AfterStateHoliday', 'BeforeStateHoliday', 'AfterPromo', 'BeforePromo',
       'SchoolHoliday_bw', 'StateHoliday_bw', 'Promo_bw', 'SchoolHoliday_fw',
       'StateHoliday_fw', 'Promo_fw'],
      dtype='object')

joined = join_df(joined, df, ['Store', 'Date'])
joined = joined[joined.Sales!=0]
joined.reset_index(inplace=True)
joined.to_feather(f'{PATH}joined')

Check point! Save your feather file!

joined = pd.read_feather(f'{PATH}joined')

Discussion on Categorical

Which say which are categorical / continuous. This is a decision that we are going to make. Sometimes there are variables that are directly coded as categorical, there’s only one choice. But continuous in the data, you can make a decision.

For the 3rd place team, most of the continuous ones are mainly the floating point numbers. The number of levels in a category is the cardinality. We could bin variables 10 → 100 into 10-20, 20-30, etc.

cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
    'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
    'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
    'SchoolHoliday_fw', 'SchoolHoliday_bw']

contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
   'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
   'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
   'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday']

n = len(joined); n

Now we can loop through and change them all into categorical

Turn all the continuous ones into 32bit floating point for pytorch’s sake

for v in cat_vars: joined[v] = joined[v].astype('category').cat.as_ordered()
for v in contin_vars: joined[v] = joined[v].astype('float32')
dep = 'Sales'
joined = joined[cat_vars+contin_vars+[dep, 'Date']]

idxs = get_cv_idxs(n, val_pct=150000/n)
joined_samp = joined.iloc[idxs].set_index("Date")
samp_size = len(joined_samp); samp_size

Get a smaller sample size

samp_size = n
joined_samp = joined.set_index("Date")

joined_samp.head(2)

	Store	DayOfWeek	Year	Month	Day	StateHoliday	CompetitionMonthsOpen	Promo2Weeks	StoreType	Assortment	...	Max_Wind_SpeedKm_h	Mean_Wind_SpeedKm_h	CloudCover	trend	trend_DE	AfterStateHoliday	BeforeStateHoliday	Promo	SchoolHoliday	Sales
Date
2015-07-31	1	5	2015	7	31	False	24	0	c	a	...	24.0	11.0	1.0	85.0	83.0	57.0	-9.223372e+18	1.0	1.0	5263
2015-07-31	2	5	2015	7	31	False	24	25	a	a	...	14.0	11.0	4.0	80.0	83.0	67.0	-9.223372e+18	1.0	1.0	6064

2 rows × 39 columns

Fastai Process Dataframe

proc_df - pulls out the target (y) and deletes from the original.
Also scales the dataframe.
Also creates another object to keep track of std and mean for changing the test set.
Also handles missing values, fills with median.

df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True)
yl = np.log(y)

train_ratio = 0.75
# train_ratio = 0.9
train_size = int(samp_size * train_ratio); train_size
val_idx = list(range(train_size, len(df)))

val_idx = np.flatnonzero(
    (df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))

Validation Set - Last two weeks

Make it as similar dataset as possible. This won’t be a random set like other times because this is time series.

The Deep Learning Version

def inv_y(a): return np.exp(a)

def exp_rmspe(y_pred, targ):
    targ = inv_y(targ)
    pct_var = (targ - inv_y(y_pred))/targ
    return math.sqrt((pct_var**2).mean())

max_log_y = np.max(yl)
y_range = (0, max_log_y*1.2)

Very similar to the image recognition process

find training dataset
find test set
set a learner
fit and go

But since we are not doing images, but structured data: ColumnarModelData

PATH - where to store everything that you save later
val_idx - which we will put in the validation set
df - dataframe
yl - target (dependent variable)
cat_flds - which things we want to treat as categorical. At this time everything is a number

md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl, cat_flds=cat_vars, bs=128)

Jumping ahead we need to set the learner …

How much drop out
How many embeddings

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3

New Topic: Embeddings

The continuous variables, we are going to grab them all.

What if you had days of the week, lets assign 4 random numbers.

0-Sunday
1-Mon
2-Tue
3-Wed
4-Thu
5-Fri
6-Sat

Where did 4 come from for Day of Week?

What’s our Cardinality for our Categorical Variables?

cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]

cat_sz

[('Store', 1116),
 ('DayOfWeek', 8),
 ('Year', 4),
 ('Month', 13),
 ('Day', 32),
 ('StateHoliday', 3),
 ('CompetitionMonthsOpen', 26),
 ('Promo2Weeks', 27),
 ('StoreType', 5),
 ('Assortment', 4),
 ('PromoInterval', 4),
 ('CompetitionOpenSinceYear', 24),
 ('Promo2SinceYear', 9),
 ('State', 13),
 ('Week', 53),
 ('Events', 22),
 ('Promo_fw', 7),
 ('Promo_bw', 7),
 ('StateHoliday_fw', 4),
 ('StateHoliday_bw', 4),
 ('SchoolHoliday_fw', 9),
 ('SchoolHoliday_bw', 9)]

Rough Formula : Take your cardinality, divide by 2, but not more than 50

emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

emb_szs

[(1116, 50),
 (8, 4),
 (4, 2),
 (13, 7),
 (32, 16),
 (3, 2),
 (26, 13),
 (27, 14),
 (5, 3),
 (4, 2),
 (4, 2),
 (24, 12),
 (9, 5),
 (13, 7),
 (53, 27),
 (22, 11),
 (7, 4),
 (7, 4),
 (4, 2),
 (4, 2),
 (9, 5),
 (9, 5)]

Embedding = Distributed Representation

Jumping ahead we need to set the learner …

emb_szs Our embeddings
len(df.columns)-len(cat_vars)- how many continous variables
[1000,500] activations in linear layer
[0.001,0.01] - drop outs per linear layer
1 - our output

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3

m.fit(lr, 3, metrics=[exp_rmspe])

[ 0.       0.02479  0.02205  0.19309]                          
[ 1.       0.02044  0.01751  0.18301]                          
[ 2.       0.01598  0.01571  0.17248]

m.fit(lr, 1, metrics=[exp_rmspe], cycle_len=1)

[ 0.       0.00676  0.01041  0.09711]

Step Overall

List the Categorical
List of row indexes in your dataset
Call the columnar dataset
How big you want your embedding matrix
Get learner
Fit

2. Deep Learning for Language

Language notebook

Language Modeling

Can you predict what the next word is going to be? Swiftkey actually does this. Modeling predicts the next word.

IMDb movie review dataset

http://ai.stanford.edu/~amaas/data/sentiment/

#!pip install spacy

To download the ‘en’ library needed, run:

#!python -m spacy download en

The large movie view dataset contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The sentiment classification task consists of predicting the polarity (positive or negative) of a given text.

However, before we try to classify sentiment, we will simply try to create a language model; that is, a model that can predict the next word in a sentence. Why? Because our model first needs to understand the structure of English, before we can expect it to recognize positive vs negative sentiment.

So our plan of attack is the same as we used for Dogs v Cats: pretrain a model to do one thing (predict the next word), and fine tune it to do something else (classify sentiment).

Unfortunately, there are no good pretrained language models available to download, so we need to create our own.

import torch

%reload_ext autoreload
%autoreload 2
%matplotlib inline

import sys
sys.path.append('/home/paperspace/repos/fastai/')

from fastai.imports import *
from fastai.torch_imports import *
from fastai.core import *
from fastai.model import fit
from fastai.dataset import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle

Torchtext - is Pytorch’s NLP library

PATH='/home/paperspace/Desktop/data/imdb/aclImdb/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

imdbEr.txt  imdb.vocab  emodelse/  README  etest/  etmp/  etrain/

trn_files = !ls {TRN}
trn_files[:10]

['0_0.txt',
 '0_3.txt',
 '0_9.txt',
 '10000_0.txt',
 '10000_4.txt',
 '10000_8.txt',
 '1000_0.txt',
 '10001_0.txt',
 '10001_10.txt',
 '10001_4.txt']

review = !cat {TRN}{trn_files[6]}
review[0]

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop-socky fung-ku, but what I got instead was a comedy. So, it wasn't quite was I was expecting, but I really liked it anyway! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them!! I was laughing my ass off. I mean, the cops were just so bad! And when I say bad, I mean The Shield Vic Macky bad. But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose...man, oh man. What can you say about that hottie. She was great and put those other actresses to shame. She should work more often!!!!! I also really liked the fight scene outside of the building. That was done really well. Lots of fighting and people getting their heads banged up. FUN! Last, but not least Joe Estevez and William Smith were great as the...well, I wasn't sure what they were, but they seemed to be having fun and throwing out lines. I mean, some of it didn't make sense with the rest of the flick, but who cares when you're laughing so hard! All in all the film wasn't the greatest thing since sliced bread, but I wasn't expecting that. It was a Troma flick so I figured it would totally suck. It's nice when something surprises you but not totally sucking.<br /><br />Rent it if you want to get stoned on a Friday night and laugh with your buddies. Don't rent it if you are an uptight weenie or want a zombie movie with lots of flesh eating.<br /><br />P.S. Uwe Boil was a nice touch."

!find {TRN} -name '*.txt' | xargs cat | wc -w

17486581

!find {VAL} -name '*.txt' | xargs cat | wc -w

' '.join(spacy_tok(review[0]))

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop - socky fung - ku , but what I got instead was a comedy . So , it was n't quite was I was expecting , but I really liked it anyway ! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them ! ! I was laughing my ass off . I mean , the cops were just so bad ! And when I say bad , I mean The Shield Vic Macky bad . But unlike that show I was laughing when they shot people and smoked dope . \n\n Felissa Rose ... man , oh man . What can you say about that hottie . She was great and put those other actresses to shame . She should work more often ! ! ! ! ! I also really liked the fight scene outside of the building . That was done really well . Lots of fighting and people getting their heads banged up . FUN ! Last , but not least Joe Estevez and William Smith were great as the ... well , I was n't sure what they were , but they seemed to be having fun and throwing out lines . I mean , some of it did n't make sense with the rest of the flick , but who cares when you 're laughing so hard ! All in all the film was n't the greatest thing since sliced bread , but I was n't expecting that . It was a Troma flick so I figured it would totally suck . It 's nice when something surprises you but not totally sucking . \n\n Rent it if you want to get stoned on a Friday night and laugh with your buddies . Do n't rent it if you are an uptight weenie or want a zombie movie with lots of flesh eating . \n\n P.S. Uwe Boil was a nice touch ."

We use Pytorch’s torchtext library to pre-process our data, telling it to use the wonderful spacy library to handle tokenization.

First, we create a torchtext field, which describes how to pre-process a piece of text - in this case, we tell torchtext to make everything lowercase, and tokenize it with spacy.

we will lower, then we will spacy tokenize.

TEXT = data.Field(lower=True, tokenize=spacy_tok)

fastai library works closely with torchtext. We create a ModelData object for language modeling by taking advantage of LanguageModelData, passing it our torchtext field object, and the paths to our training, test, and validation sets. In this case, we don’t have a separate test set, so we’ll just use VAL_PATH for that too.

As well as the usual bs (batch size) parameter, we also not have bptt; this define how many words are processing at a time in each row of the mini-batch. More importantly, it defines how many ‘layers’ we will backprop through. Making this number higher will increase time and memory requirements, but will improve the model’s ability to handle long sentences.

min_freq - must occur with 10 times
bs - batch size
bptt - back propagation (through time) - how much of the sentence will sit at any time

bs=64; bptt=70

WARNING: takes a long time to run

FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

After building our ModelData object, it automatically fills the TEXT object with a very important attribute: TEXT.vocab. This is a vocabulary, which stores which words (or tokens) have been seen in the text, and how each word will be mapped to a unique integer id. We’ll need to use this information again later, so we save it.

(Technical note: python’s standard Pickle library can’t handle this correctly, so at the top of this notebook we used the dill library instead and imported it as pickle).

Save the model for later

pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

Here are the: # batches; # unique tokens in the vocab; # tokens in the training set; # sentences

len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(4603, 34933, 1, 20626672)

This is the start of the mapping from integer IDs to unique tokens.

# 'itos': 'int-to-string'
TEXT.vocab.itos[:12]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'it', 'in']

# 'stoi': 'string to int'
TEXT.vocab.stoi['the']

Note that in a LanguageModelData object there is only one item in each dataset: all the words of the text joined together.

TEXT.numericalize([md.trn_ds[0].text[:12]])

Variable containing:
   37
  866
    3
   27
  194
   67
    3
    2
  378
    3
   27
  194
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

Our LanguageModelData object will create batches with 64 columns (that’s our batch size), and varying sequence lengths of around 80 tokens (that’s our bptt parameter - backprop through time).

Each batch also contains the exact same data as labels, but one word later in the text - since we’re trying to always predict the next word. The labels are flattened into a 1d array.

next(iter(md.trn_dl))

(Variable containing:
     37    409     22  ...      50     56      4
    866    232      6  ...     815     51     34
      3    584    655  ...       4     59    325
         ...            ⋱           ...         
     44     60   2332  ...      14    841   9681
   2768     10      9  ...      75      3    360
   1032    149    899  ...   12254      6     14
 [torch.cuda.LongTensor of size 81x64 (GPU 0)], Variable containing:
    866
    232
      6
   ⋮   
     32
    865
     27
 [torch.cuda.LongTensor of size 5184 (GPU 0)])

Let’s make our own Embedding off of our data

md.nt = number of tokens = 34,495
md.trn_ds[0].text = 20,621,966

Train

We have a number of parameters to set - we’ll learn more about these later, but you should find these values suitable for many problems.

em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

Researchers have found that large amounts of momentum (which we’ll learn about later) don’t work well with these kinds of RNN models, so we create a version of the Adam optimizer with less momentum than it’s default of 0.9.

Adam Optimizer, the defaults don’t work very well

opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

fastai library uses a variant of the state of the art AWD LSTM Language Model developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through Dropout. There is no simple way known (yet!) to find the best values of the dropout parameters below - you just have to experiment…

However, the other parameters (alpha, beta, and clip) shouldn’t generally need tuning.

opt_fn - defaults for adam optimizer
em_sz - embedding size
nh - number of activations
nl - number of layers
dropout - how much drop out to add (won’t elaborate)
learner.clip=0.3 - don’t let the gradient go any lower than this number

This is a RNN using an LSTM

learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

As you can see below, I gradually tuned the language model in a few stages. I possibly could have trained it further (it wasn’t yet overfitting), but I didn’t have time to experiment more. Maybe you can see if you can train it to a better accuracy! (I used lr_find to find a good learning rate, but didn’t save the output in this notebook. Feel free to try running it yourself now.)

Trying different settings and modeling

learner.fit(3e-3, 1, wds=1e-6, cycle_len=20, cycle_save_name='adam3_20')

[ 0.       6.44582  6.40362]                                    
[ 1.       4.71655  4.61361]                                  
[ 2.       4.59772  4.49624]                                  
[ 3.       4.53644  4.44157]                                  
[ 4.       4.5055   4.40163]                                  
[ 5.       4.47234  4.37386]                                  
[ 6.       4.44514  4.35307]                                  
[ 7.       4.42154  4.33165]                                  
[ 8.       4.39408  4.31588]                                  
[ 9.       4.38333  4.30152]                                  
[ 10.        4.35194   4.28591]                               
[ 11.        4.35191   4.27355]                               
[ 12.        4.32064   4.25818]                               
[ 13.        4.29965   4.2474 ]                               
[ 14.        4.28177   4.23785]                               
[ 15.        4.28475   4.23243]                               
[ 16.        4.25953   4.22698]                               
[ 17.        4.23739   4.22239]                               
[ 18.        4.24809   4.22182]                               
[ 19.        4.25874   4.22156]

learner.load_cycle('adam3_20', 0)

learner.save_encoder('adam3_20_enc')

Sentiment

We’ll need to the saved vocab from the language model, since we need to ensure the same words map to the same IDs.

TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))

sequential=False tells torchtext that a text field should be tokenized (in this case, we just want to store the ‘positive’ or ‘negative’ single label).

splits is a torchtext method that creates train, test, and validation sets. The IMDB dataset is built into torchtext, so we can take advantage of that. Take a look at lang_model-arxiv.ipynb to see how to define your own fastai/torchtext datasets.

IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')

t = splits[0].examples[0]

t.label, ' '.join(t.text[:16])

('pos',
 'i remember watching this movie with my friends when we were 4 years old , but')

fastai library can create a ModelData object directly from torchtext splits.

md2 = TextData.from_splits(PATH, splits, bs)

m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
           dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3)
m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
m3.load_encoder(f'adam3_20_enc')

Because we’re fine-tuning a pretrained model, we’ll use differential learning rates, and also increase the max gradient for clipping, to allow the SGDR to work better.

m3.clip=25.
lrs=np.array([1e-4,1e-3,1e-2])

3. Deep Learning for Recommendation

Next week, we will be learning Collaborative Filtering and making this.

jeremy · November 21, 2017, 4:44pm

Now wikified!

wgpubs · November 21, 2017, 5:02pm

It feels like Christmas when these posts arrive @timlee !

vikram · November 21, 2017, 5:35pm

This is wickedly amazing! You rock @timlee.
Thanks!

ecdrid · November 26, 2017, 3:55pm

Thank you…

kmatsuda · November 27, 2017, 8:56pm

@timlee, these notes are so amazing. Thank you for sharing!

creviera · December 2, 2017, 7:56am

Nice notes! I’m still working my way through them, but I wanted to mention that I think Jeremy mentioned that weights, not activations, are sometimes called parameters. I also wonder if it’s not better to think of weights as features, or as representing features we want to find, rather than activations as features.
(referene https://youtu.be/gbceqO8PpBg?t=5m45s)

suryatk · December 2, 2017, 9:40am

@timlee great notes! in the image for embeddings of categorical features, the embeddings are concatenated or are they summed? each categorical feature can have variable length embedding.

jeremy · December 2, 2017, 5:06pm

Yes that’s correct, and it’s an important point. Since the notes post is a wiki, perhaps you could edit it to make that correction?

Generally we create features (with feature engineering) by doing computations to the variables. It’s the output of these computations that are the features, not the parameters to them. Therefore, it does seem appropriate to describe the activations as ‘features’, at least in some contexts.

creviera · December 2, 2017, 9:57pm

+1 sounds good. made a pr on the notes.

mmr · December 3, 2017, 9:36am

Hi Jeremy - I have couple of questions - one is usage of 1d convolution instead of RNN - is it possible/advisable to make a language model using 1d convolution instead of RNN; and another one is about application of the language model.How useful or sane do you think using the language model in order kinds of data types - for example in genome sequencing - where the output would be malignant / benign tumor and the input might be short sequence of genomes considered to be as words.

jeremy · December 3, 2017, 6:59pm

The current SoTA for language modeling is an RNN, but CNNs have shown good results in similar areas.

I think a language model for genomics would be fascinating…

aptrishu · January 12, 2018, 4:05pm

@jeremy Here

df["Promo2Since"] = pd.to_datetime(df.apply(lambda x: Week(
    x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1).astype(pd.datetime))

Is the Week method provided by pandas? All I understand here is we are converting given year and week number to a date. Also what does monday() do here?
I didn’t find any relevant information about it on google search.

jeremy · January 27, 2018, 6:21pm

It’s coming from this package: isoweek · PyPI

tdphong · February 22, 2018, 11:18am

Hi, I just have a question. This language model is using the movie dataset from the IMDB where the review is negative or positive. But for other sentiments like whether a person like football or not, how could we get the training dataset for such that problem ?

joost · March 13, 2018, 8:56am

tells whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.

MicPie · July 15, 2018, 9:57am

Gradient clipping is used to avoid to large gradients, or am I wrong?