DeepLearning-Lec10-Notes

timlee · April 9, 2018, 3:10am

Hi All,

Apologies for the delay, I’ve got too many side projects going on, and training the language model took me about 30 mins / epoch, which I will have to investigate later. Here’s the notes from lecture 10. Looks like the readling list is growing with some good paper suggestions!

Tim

Lesson 10

NLP Classification and Translation

We are going to look at NLP this week

Review of Object Detection:

See Lesson 8 / Lesson 9

NLP

Introduction

Recall in image classification, with conv_learner it has a standard way to stick a layer ontop (any custom head) and have it do anything we like. Now we have flexibility to understand rotations, or solve other interesting problems. Lets take this concept and apply this to text.

NLP + Computer vision techniques Roadmap:

Learn to find word structures from images aka, captioning
Learn to find images to word structures
Learn to find images from images
Image enhancement - upscaling low photos into high res
Segmenting a picture into objects

`torchtext` to `fastai.text`

torchtext is great, but its slow, it doesn’t run in parallel, and it doesn’t remember. fastai.text is a combination of torchtext and fastai.nlp. Consider fastai.nlp deprecated (outdated) at this point.

Import libraries

import sys
sys.path.append('../')
from fastai.text import *
import html

IMDB - Internet Movie Database

weblink to python notebook

Get the dataset

!wget http://files.fast.ai/data/aclImdb.tgz
!tar -xf aclImdb.tgz

Check the file directories:

!ls ~/data/aclImdb/

    ->imdbEr.txt  imdb.vocab	README	test  train

Check the file counts

!ls /home/paperspace/data/aclImdb/train/all | wc -l
75000

!ls /home/paperspace/data/aclImdb/test/all | wc -l
25000

Look at directories

!ls /home/paperspace/data/aclImdb/

imdbEr.txt  imdb.vocab	models	README	test  tmp  train

!ls /home/paperspace/data/aclImdb/train

all		 neg  unsup	     urls_neg.txt  urls_unsup.txt
labeledBow.feat  pos  unsupBow.feat  urls_pos.txt

Create a `CLAS` - classification and `LM` - language model

# create paths to save future features
CLAS_PATH=Path('/home/paperspace/data/imdb_clas/')
CLAS_PATH.mkdir(exist_ok=True)


LM_PATH=Path('/home/paperspace/data/imdb_lm/')
LM_PATH.mkdir(exist_ok=True)

BOS = 'xbos'  # beginning-of-sentence tag
FLD = 'xfld'  # data field tag
CLASSES = ['neg', 'pos', 'unsup']


PATH=Path('/home/paperspace/data/aclImdb/')

Turning Text into Numbers

How do we turn sentences into numbers?

First lets prepare the `imdb` dataset, by creating a panda dataframe from the collection of `.txt` files

def get_texts(path):
    """
    This function will go through the aclImdb folder
    and create the necessary datasets
    """
    
    # initializes the text and labels collections
    texts,labels = [],[]

    
    # for each sentiment
    for idx,label in enumerate(CLASSES):
    
    
        # will go through the 
        for fname in (path/label).glob('*.*'):
    
    
            # open the file and append the filetext
            texts.append(fname.open('r').read())
    
    
            # open 
            labels.append(idx)
    return np.array(texts),np.array(labels)


trn_texts,trn_labels = get_texts(PATH/'train')
val_texts,val_labels = get_texts(PATH/'test')

Look at a sample of the text

trn_labels[0],trn_texts[0]

(0,
 "Basically, Cruel Intentions 2 is Cruel Intentions 1, again, only poorly done. The story is exactly the same as the first one (even some of the lines), with only a few exceptions. The cast is more unknown, and definitely less talented. Instead of being seductive and drawing me into watching it, I ended up feeling dirty because it compares to watching a soft-core porn. I'm not sure whether to blame some of the idiotic lines on the actors or the writers...and I always feel bad saying that, because I know how hard it is to do both...but it was basically a two-hour waste of my life. It literally amazes me that some movies get made, and this is no exception...I can't believe they'd make a third one.")

Prepare dataframe

col_names = ['labels','text']
np.random.seed(42)


# shuffle the indexes in place
trn_idx = np.random.permutation(len(trn_texts))
val_idx = np.random.permutation(len(val_texts))


#shuffle texts
trn_texts = trn_texts[trn_idx]
val_texts = val_texts[val_idx]


#shuffle the labels 
trn_labels = trn_labels[trn_idx]
val_labels = val_labels[val_idx]


#create dataframe
df_trn = pd.DataFrame({'text':trn_texts, 'labels':trn_labels}, columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 'labels':val_labels}, columns=col_names)


# saving training and validation dataset
df_trn[df_trn['labels']!=2].to_csv(CLAS_PATH/'train.csv', header=False, index=False)
df_val.to_csv(CLAS_PATH/'test.csv', header=False, index=False)


#write the classes
(CLAS_PATH/'classes.txt').open('w').writelines(f'{o}\n' for o in CLASSES)

Create trn and validation texts with different test size

from sklearn.model_selection import train_test_split

trn_texts,val_texts = train_test_split(np.concatenate([trn_texts,val_texts]), test_size=0.1)

len(trn_texts), len(val_texts)

(90000, 10000)

Create Language model `LM` dataframes and save

df_trn = pd.DataFrame({'text':trn_texts, 'labels':[0]*len(trn_texts)}, columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 'labels':[0]*len(val_texts)}, columns=col_names)

df_trn.to_csv(LM_PATH/'train.csv', header=False, index=False)
df_val.to_csv(LM_PATH/'test.csv', header=False, index=False)

Language Model Tokens

tokenization - turn a sentence into words with some specific rules, correcting for puncutation.
fixup() - there’s always strange encodings in text, here are a couple that will replace some outlier characters
get_texts() - will iterate through all files and collect the in-file text
get_all() - will call get_texts() repeatedly for each row of the dataframe, pulling text from the source text files and returning tokens and labels

Reference

Tokenizer().proc_all_mp(partition_by_cores(texts))

proc_all_mp function

def proc_all_mp(ss, lang='en'):
        ncpus = num_cpus()//2
        with ProcessPoolExecutor(ncpus) as e:
            return sum(e.map(Tokenizer.proc_all, ss, [lang]*len(ss)), [])

partition_by_cores function

def partition_by_cores(a):
    return partition(a, len(a)//num_cpus() + 1)

ProcessPoolExecutor is from a Python 3 standard library:
More on multiprocessing

# generalized use of ProcessPoolExecutor
import concurrent.futures
def main():
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for number, prime in zip(PRIMES, executor.map(is_prime, PRIMES)):
            print('%d is prime: %s' % (number, prime))

"""
when given to pandas, it won't return a full dataframe, 
but it will return an iterator. It will return sub-sized chunks
over and over again
"""
chunksize=24000

re1 = re.compile(r'  +')

def fixup(x):
    """ Cleans up erroroneus characters"""
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' @.@ ','.').replace(
        ' @-@ ','-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))


def get_texts(df, n_lbls=1):
    
    # pull the labels out from the dataframe
    labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)
    
    # pull the full FILEPATH for each text
    # BOS is a flag to indicate when a new text is starting
    texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
    
    # Sometimes, text has title, or other sub-sections. We will record all of these
    for i in range(n_lbls+1, len(df.columns)): texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)
    texts = texts.apply(fixup).values.astype(str)

    # Tokenize the data
    tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
    return tok, list(labels)


def get_all(df, n_lbls):
    tok, labels = [], []
    for i, r in enumerate(df):
        print(i)
        tok_, labels_ = get_texts(r, n_lbls)
        tok += tok_;
        labels += labels_
    return tok, labels

Install Spacy if you don’t have it

!pip install spacy

Requirement already satisfied: spacy in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages
Requirement already satisfied: numpy>=1.7 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: murmurhash<0.29,>=0.28 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: cymem<1.32,>=1.30 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: preshed<2.0.0,>=1.0.0 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: thinc<6.11.0,>=6.10.1 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: plac<1.0.0,>=0.9.6 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: six in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: html5lib==1.0b8 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: pathlib in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: ujson>=1.35 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: dill<0.3,>=0.2 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: regex==2017.4.5 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: ftfy<5.0.0,>=4.4.2 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: msgpack-python==0.5.4 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: msgpack-numpy==0.4.1 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from spacy)
Requirement already satisfied: wrapt in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from thinc<6.11.0,>=6.10.1->spacy)
Requirement already satisfied: tqdm<5.0.0,>=4.10.0 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from thinc<6.11.0,>=6.10.1->spacy)
Requirement already satisfied: cytoolz<0.9,>=0.8 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from thinc<6.11.0,>=6.10.1->spacy)
Requirement already satisfied: termcolor in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from thinc<6.11.0,>=6.10.1->spacy)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy)
Requirement already satisfied: idna<2.7,>=2.5 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy)
Requirement already satisfied: certifi>=2017.4.17 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy)
Requirement already satisfied: wcwidth in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from ftfy<5.0.0,>=4.4.2->spacy)
Requirement already satisfied: toolz>=0.8.0 in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages (from cytoolz<0.9,>=0.8->thinc<6.11.0,>=6.10.1->spacy)

Download English Model

!python -m spacy download en

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
e[K    100% |████████████████████████████████| 37.4MB 6.7MB/s ta 0:00:0111
e[?25h  Requirement already satisfied (use --upgrade to upgrade): en-core-web-sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz in /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages

e[93m    Linking successfule[0m
    /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/en_core_web_sm
    -->
    /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/data/en

    You can now load the model via spacy.load('en')

import spacy
nlp = spacy.load('en')

Read in our Language model data, extract texts

 __spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"

df_trn = pd.read_csv(LM_PATH/'train.csv', header=None, chunksize=chunksize)
df_val = pd.read_csv(LM_PATH/'test.csv', header=None, chunksize=chunksize)

tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)

Make working `tmp` directory & save tokens

(LM_PATH/'tmp').mkdir(exist_ok=True)

np.save(LM_PATH/'tmp'/'tok_trn.npy', tok_trn)
np.save(LM_PATH/'tmp'/'tok_val.npy', tok_val)

Load tokenization & check top tokens

tok_trn = np.load(LM_PATH/'tmp'/'tok_trn.npy')
tok_val = np.load(LM_PATH/'tmp'/'tok_val.npy')

freq = Counter(p for o in tok_trn for p in o)

freq.most_common(25)

[('the', 1208449),
 ('.', 992545),
 (',', 986614),
 ('and', 587567),
 ('a', 583520),
 ('of', 525412),
 ('to', 484871),
 ('is', 393923),
 ('it', 341485),
 ('in', 337351),
 ('i', 307751),
 ('this', 270410),
 ('that', 261107),
 ('"', 237920),
 ("'s", 222037),
 ('-', 188209),
 ('was', 180235),
 ('\n\n', 179009),
 ('as', 166145),
 ('with', 159253),
 ('for', 158601),
 ('movie', 157735),
 ('but', 150659),
 ('film', 144618),
 ('you', 123979)]

recreate a single entry

xbos - start stream
xfld - start field
word on caps - how do get the semantic impact of CAPS vs. normal version?
t_up - add a tag in front to indicate the next word is uppercase

' '.join(tok_trn[0])

'\n xbos xfld 1 i saw this movie at the dragon*con 2006 independent film festival . it was awarded 2 awards at that festival and rightfully so . this is probably the best short horror film i \'ve ever seen . the simplicity of camera usage really works.the main character is brilliant . his acting is quite good and is believable . the 3 cameras in the room with tim russel make his insanity that much more believable . i love it . i have talked with mike and he says that they are in the process of making a feature film compassing the first three chapters together . i ca n\'t wait . i will be first in line for that film . the effects of the " mirror " creatures are used so well . you do n\'t see them for very long so it scares the pants off of you when you do . i recommend this film to anyone who wants to watch a good horror movie for once . best 32 minutes of spine tingling horror i \'ve ever seen . thanks mike .'

Setup the Vocabulary and add a term for unknown and for padding

max_vocab = 60000
min_freq = 2


# index to word
itos = [o for o,c in freq.most_common(max_vocab) if c>min_freq]
itos.insert(0, '_pad_')
itos.insert(0, '_unk_')


# word to index
stoi = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(itos)})
print(len(itos))


# create a array of token_indices 
trn_lm = np.array([[stoi[o] for o in p] for p in tok_trn])
val_lm = np.array([[stoi[o] for o in p] for p in tok_val])


# save the i
np.save(LM_PATH/'tmp'/'trn_ids.npy', trn_lm)
np.save(LM_PATH/'tmp'/'val_ids.npy', val_lm)
pickle.dump(itos, open(LM_PATH/'tmp'/'itos.pkl', 'wb'))

Example of a numerical representation

' '.join([str(val) for val in trn_lm[0]])

'40 41 42 39 12 235 13 23 44 2 0 3368 1662 25 1331 3 10 18 8833 261 2474 44 14 1331 5 9383 51 3 13 9 263 2 138 364 200 25 12 159 143 129 3 2 5416 7 371 9080 83 58696 305 122 9 556 3 35 136 9 204 66 5 9 842 3 2 379 3992 11 2 655 21 1895 13095 113 35 4853 14 93 67 842 3 12 133 10 3 12 36 3515 21 1555 5 34 566 14 45 33 11 2 1648 7 251 6 820 25 0 2 105 299 8000 312 3 12 196 29 881 3 12 104 37 105 11 367 22 14 25 3 2 306 7 2 15 3006 15 2204 33 345 51 88 3 26 57 29 82 111 22 69 216 51 10 2702 2 3719 141 7 26 68 26 57 3 12 404 13 25 8 273 48 505 8 126 6 66 200 23 22 301 3 138 13558 249 7 7191 21684 200 12 159 143 129 3 1179 1555 3'

Load from checkpoint

trn_lm = np.load(LM_PATH/'tmp'/'trn_ids.npy')
val_lm = np.load(LM_PATH/'tmp'/'val_ids.npy')
itos = pickle.load(open(LM_PATH/'tmp'/'itos.pkl', 'rb'))

vs=len(itos)
vs,len(trn_lm)

(60002, 90000)

Instead of pretraining on imagenet, for NLP we will use a large subset of wikipedia

Previously in lesson 4, we trained a language model that was state of the art. Using a pre-selected text articles from wikipedia, a language model was trained, and the weights were saved.

wikitext103 model - weights

#! wget -nH -r -np -P {PATH} http://files.fast.ai/models/wt103/

How to use a pretrained model - must have same network sizes

em_sz - embedding sizes for vectors (400)
nh - number of hidden (# of activations)
nl - number of layers (hidden)
model type: AWD LSTM link

Some work will have to be done to map the pre-trained vocabulary to the current vocab that we are working with. Any words not found in pre-trained vocab, we will use global mean values.

em_sz,nh,nl = 400,1150,3

# set filepaths
PRE_PATH = PATH/'models'/'wt103'
PRE_LM_PATH = PRE_PATH/'fwd_wt103.h5'


# load weights (returns a dictionary)
wgts = torch.load(PRE_LM_PATH, map_location=lambda storage, loc: storage)


# pull out embedding weights
# sized vocab x em_sz 
enc_wgts = to_np(wgts['0.encoder.weight'])
row_m = enc_wgts.mean(0)


# load pre trained vocab to index mappings
itos2 = pickle.load((PRE_PATH/'itos_wt103.pkl').open('rb'))
stoi2 = collections.defaultdict(lambda:-1, {v:k for k,v in enumerate(itos2)})


# create a pre-trained -> current corpus vocab to vocab mapping
# initialize an empty matrix
new_w = np.zeros((vs, em_sz), dtype=np.float32)


# loop through by row index and insert the correct embedding
for i,w in enumerate(itos):
    r = stoi2[w]
    new_w[i] = enc_wgts[r] if r>=0 else row_m

    
# create our torch `state` that we will load later
wgts['0.encoder.weight'] = T(new_w)
wgts['0.encoder_with_dropout.embed.weight'] = T(np.copy(new_w))
wgts['1.decoder.weight'] = T(np.copy(new_w))

Language Model Creation

wd - weight decay
bptt - back prop through time
bs - batchsize

We will be performing a continuous process of

    given words --> predict next word

Lesson 4 - note the best loss

[ 0.      4.3926  4.2917]                                       
[ 1.       4.37693  4.28255]                                  
[ 2.       4.37998  4.27243]                                  
[ 3.       4.34284  4.24789]                                  
[ 4.      4.3287  4.2317]                                     
[ 5.       4.28881  4.20722]                                  
[ 6.       4.24637  4.18926]                                  
[ 7.       4.23797  4.17644]

Pretrained Model - we already start with a better score

epoch      trn_loss   val_loss   accuracy                     
    0      4.332359   4.120674   0.289563  
    1      4.247177   4.067932   0.294281

Comparison to `Word2Vec`

Word2Vec - single embedding matrix. Each word has a matrix and thats it. It’s a single layer (input) from a pretrained model. It’s from a linear model on a co-occurance matrix.

wd=1e-7
bptt=70
bs=52
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))


trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)
val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)
md = LanguageModelData(PATH, 1, vs, trn_dl, val_dl, bs=bs, bptt=bptt)

drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*0.7

learner= md.get_model(opt_fn, em_sz, nh, nl, 
    dropouti=drops[0], dropout=drops[1], wdrop=drops[2], dropoute=drops[3], dropouth=drops[4])

learner.metrics = [accuracy]
learner.unfreeze()

Language Model Components:

1. DataLoader

2. ModelData

3. Model

1. Language Model Loader

Key Notes:

Batch size is not the count of things in a batch. In this context, batch_size is more batch_ct
Randomness-> size instead of order - if we grab 70 at a time, and do a new epoch, the data will be exactly the same, and all the batches will be identical. In images we would shuffle, but that doesn’t work in the language model, because it is trying to learn the sentence. So if order can’t change, let’s randomly change the sequence length

class LanguageModelLoader():
    """Returns tuples of mini-batches."""
    
    def __init__(self, nums, bs, bptt, backwards=False):
        
        # assign values
        self.bs,self.bptt,self.backwards = bs,bptt,backwards
        
        # batchify the numbers. Based on the batchsize
        # subdivide the data. Note: batchsize 64
        # 640,000 would be broken into 64 x 10,000 
        self.data = self.batchify(nums)
        
        # initialize other values
        self.i,self.iter = 0,0
        self.n = len(self.data)


    def __iter__(self):
        """ Iterator implementation"""
            
        # start from zero
        self.i,self.iter = 0,0
            
        # will continually pull data out
        while self.i < self.n-1 and self.iter<len(self):
            if self.i == 0:
                seq_len = self.bptt + 5 * 5
            else:
                bptt = self.bptt if np.random.random() < 0.95 else self.bptt / 2.
                seq_len = max(5, int(np.random.normal(bptt, 5)))
            res = self.get_batch(self.i, seq_len)
            self.i += seq_len
            self.iter += 1
                        
            # yields the value
            yield res

                        
    def __len__(self): return self.n // self.bptt - 1

                        
    def batchify(self, data):
        """splits the data into batch_size counts of sets"""
        nb = data.shape[0] // self.bs
        data = np.array(data[:nb*self.bs])
        data = data.reshape(self.bs, -1).T
        if self.backwards: data=data[::-1]
                                
        # returns the transpose
        # have batch_size number of columns 
        return T(data)

    def get_batch(self, i, seq_len):
        source = self.data
        seq_len = min(seq_len, len(source) - 1 - i)
        return source[i:i+seq_len], source[i+1:i+1+seq_len].view(-1)

ModelData


class LanguageModelData():
    """
    - a training data loader
    - a validation data loader
    - a test loader
    - a saving path
    - model parameteres
    """
    def __init__(self, path, pad_idx, nt, trn_dl, val_dl, test_dl=None, bptt=70, backwards=False, **kwargs):
        self.path,self.pad_idx,self.nt = path,pad_idx,nt
        self.trn_dl,self.val_dl,self.test_dl = trn_dl,val_dl,test_dl

    def get_model(self, opt_fn, emb_sz, n_hid, n_layers, **kwargs):
        m = get_language_model(self.nt, emb_sz, n_hid, n_layers, self.pad_idx, **kwargs)
        model = LanguageModel(to_gpu(m))
        return RNN_Learner(self, model, opt_fn=opt_fn)

Our Language Model extends the `basic model`

We are overriding one method that returns a list of all your layer groups.

class LanguageModel(BasicModel):
    def get_layer_groups(self):
        m = self.model[0]
        return [*zip(m.rnns, m.dropouths), (self.model[1], m.dropouti)]

Extend the `learner class`, set default to cross entropy

class RNN_Learner(Learner):
    def __init__(self, data, models, **kwargs):
        super().__init__(data, models, **kwargs)
        self.crit = F.cross_entropy

    def save_encoder(self, name): save_model(self.model[0], self.get_model_path(name))
    def load_encoder(self, name): load_model(self.model[0], self.get_model_path(name))

Review of the RNN Encoder

AWD Paper

Link to the github

Embedding input layer
1 x LSTM layer per layer asked for nl
The rest are places to put dropout

Forward:

Call the Embedding Layer
Add some drop out
Call RNN layer
Append to outputs
drop out

import torch.nn as nn

class RNN_Encoder(nn.Module):

    """A custom RNN encoder network that uses
        - an embedding matrix to encode input,
        - a stack of LSTM layers to drive the network, and
        - variational dropouts in the embedding and LSTM layers
        The architecture for this network was inspired by the work done in
        "Regularizing and Optimizing LSTM Language Models".
        (https://arxiv.org/pdf/1708.02182.pdf)
    """

    initrange=0.1

    def __init__(self, ntoken, emb_sz, nhid, nlayers, pad_token, bidir=False,
                 dropouth=0.3, dropouti=0.65, dropoute=0.1, wdrop=0.5):
        """ Default constructor for the RNN_Encoder class
            Args:
                bs (int): batch size of input data
                ntoken (int): number of vocabulary (or tokens) in the source dataset
                emb_sz (int): the embedding size to use to encode each token
                nhid (int): number of hidden activation per LSTM layer
                nlayers (int): number of LSTM layers to use in the architecture
                pad_token (int): the int value used for padding text.
                dropouth (float): dropout to apply to the activations going from one LSTM layer to another
                dropouti (float): dropout to apply to the input layer.
                dropoute (float): dropout to apply to the embedding layer.
                wdrop (float): dropout used for a LSTM's internal (or hidden) recurrent weights.
            Returns:
                None
          """

        super().__init__()
        self.ndir = 2 if bidir else 1
        self.bs = 1
        self.encoder = nn.Embedding(ntoken, emb_sz, padding_idx=pad_token)
        self.encoder_with_dropout = EmbeddingDropout(self.encoder)
        self.rnns = [nn.LSTM(emb_sz if l == 0 else nhid, (nhid if l != nlayers - 1 else emb_sz)//self.ndir,
             1, bidirectional=bidir, dropout=dropouth) for l in range(nlayers)]
        if wdrop: self.rnns = [WeightDrop(rnn, wdrop) for rnn in self.rnns]
        self.rnns = torch.nn.ModuleList(self.rnns)
        self.encoder.weight.data.uniform_(-self.initrange, self.initrange)

        self.emb_sz,self.nhid,self.nlayers,self.dropoute = emb_sz,nhid,nlayers,dropoute
        self.dropouti = LockedDropout(dropouti)
        self.dropouths = nn.ModuleList([LockedDropout(dropouth) for l in range(nlayers)])

    def forward(self, input):
        """ Invoked during the forward propagation of the RNN_Encoder module.
        Args:
            input (Tensor): input of shape (sentence length x batch_size)
        Returns:
            raw_outputs (tuple(list (Tensor), list(Tensor)): list of tensors evaluated from each RNN layer without using
            dropouth, list of tensors evaluated from each RNN layer using dropouth,
        """
        sl,bs = input.size()
        if bs!=self.bs:
            self.bs=bs
            self.reset()

        emb = self.encoder_with_dropout(input, dropout=self.dropoute if self.training else 0)
        emb = self.dropouti(emb)

        raw_output = emb
        new_hidden,raw_outputs,outputs = [],[],[]
        for l, (rnn,drop) in enumerate(zip(self.rnns, self.dropouths)):
            current_input = raw_output
            with warnings.catch_warnings():
                warnings.simplefilter("ignore")
                raw_output, new_h = rnn(raw_output, self.hidden[l])
            new_hidden.append(new_h)
            raw_outputs.append(raw_output)
            if l != self.nlayers - 1: raw_output = drop(raw_output)
            outputs.append(raw_output)

        self.hidden = repackage_var(new_hidden)
        return raw_outputs, outputs

    def one_hidden(self, l):
        nh = (self.nhid if l != self.nlayers - 1 else self.emb_sz)//self.ndir
        return Variable(self.weights.new(self.ndir, self.bs, nh).zero_(), volatile=not self.training)

    def reset(self):
        self.weights = next(self.parameters()).data
        self.hidden = [(self.one_hidden(l), self.one_hidden(l)) for l in range(self.nlayers)]

Choosing Dropout

If you have less data for your language model, you will need more dropout. If you have more data, you will need less dropout. Otherwise the following dropout numbers were selected by experimentation:

drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*0.7

Note on : 0.7 if you are overfitting, increase the number, if you are underfitting, decrease this number

Normally we look at cross-entropy loss. But comparing CE Loss, if you are right, you should be very confident. Accuracy only cares if the answer was right or wrong, often times is more stable to track.

Back to the Language Model

wd=1e-7
bptt=70
bs=52
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))


trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)
val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)
md = LanguageModelData(PATH, 1, vs, trn_dl, val_dl, bs=bs, bptt=bptt)

drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*0.7

learner= md.get_model(opt_fn, em_sz, nh, nl, 
    dropouti=drops[0], dropout=drops[1], wdrop=drops[2], dropoute=drops[3], dropouth=drops[4])

learner.metrics = [accuracy]
learner.unfreeze()

note to reader - this block takes a long time to run (single epoch) - 30mins/epoch on 8 core 32 GB

import time
learner.model.load_state_dict(wgts)

# fit a single cycle
lr=1e-3
lrs = lr
start = time.time()
learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1)
print("time to train 1 epoch,", time.time()-start)

learner.save('lm_last_ft')
learner.load('lm_last_ft')
learner.unfreeze()

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))


epoch      trn_loss   val_loss   accuracy                     
    0      4.354009   4.18011    0.285487  

time to train 1 epoch, 1856.5503034591675

this will take a while to run - 15 epochs! make sure you have the time / computing resources

# search for a learning rate, then run for 15 epoches
learner.lr_find(start_lr=lrs/10, end_lr=lrs*10, linear=True)
learner.sched.plot()

start = time.time()
learner.fit(lrs, 1, wds=wd, use_clr=(20,10), cycle_len=15)
print("Time to train,", time.time() - start)

After fitting the model we save it

We save the trained model weights and separately save the encoder part of the LM model as well. This will serve as our backbone in the classification task model.

# saves the model
learner.save('lm1')

# saves just the RNN encoder (rnn_enc)
learner.save_encoder('lm1_enc')
learner.sched.plot_loss()

From FASTAI reference notebook - Results of the 15 epoches

epoch      trn_loss   val_loss   accuracy
    0      4.332359   4.120674   0.289563
    1      4.247177   4.067932   0.294281
    2      4.175848   4.027153   0.298062
    3      4.140306   4.001291   0.300798
    4      4.112395   3.98392    0.302663
    5      4.078948   3.971053   0.304059
    6      4.06956    3.958152   0.305356
    7      4.025542   3.951509   0.306309
    8      4.019778   3.94065    0.30756 
    9      4.027846   3.931385   0.308232
    10     3.98106    3.928427   0.309011
    11     3.97106    3.920667   0.30989 
    12     3.941096   3.917029   0.310515
    13     3.924818   3.91302    0.311015
    14     3.923296   3.908476   0.311586

Create the Classifier Tokens

The classifier model is basically a linear layer custom head on top of the LM backbone. Setting up the classifier data is similar to the LM data setup except that we cannot use the unsup movie reviews this time.

# read in the data again
df_trn = pd.read_csv(CLAS_PATH/'train.csv', header=None, chunksize=chunksize)
df_val = pd.read_csv(CLAS_PATH/'test.csv', header=None, chunksize=chunksize)


# get the tokens
tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)

(CLAS_PATH/'tmp').mkdir(exist_ok=True)

np.save(CLAS_PATH/'tmp'/'tok_trn.npy', tok_trn)
np.save(CLAS_PATH/'tmp'/'tok_val.npy', tok_val)

np.save(CLAS_PATH/'tmp'/'trn_labels.npy', trn_labels)
np.save(CLAS_PATH/'tmp'/'val_labels.npy', val_labels)

tok_trn = np.load(CLAS_PATH/'tmp'/'tok_trn.npy')
tok_val = np.load(CLAS_PATH/'tmp'/'tok_val.npy')


# We load the integer to vocab that we saved before
itos = pickle.load((LM_PATH/'tmp'/'itos.pkl').open('rb'))
stoi = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(itos)})

# create all matricies with indices
trn_clas = np.array([[stoi[o] for o in p] for p in tok_trn])
val_clas = np.array([[stoi[o] for o in p] for p in tok_val])

# then save the matricies
np.save(CLAS_PATH/'tmp'/'trn_ids.npy', trn_clas)
np.save(CLAS_PATH/'tmp'/'val_ids.npy', val_clas)

Classifier

Now we can create our final model, a classifier which is really a custom linear head over our trained IMDB backbone. The steps to create the classifier model are similar to the ones for the LM.

# we load our numpy arrays with indexes (representing vocab)
trn_clas = np.load(CLAS_PATH/'tmp'/'trn_ids.npy')
val_clas = np.load(CLAS_PATH/'tmp'/'val_ids.npy')

# we load our labels
trn_labels = np.squeeze(np.load(CLAS_PATH/'tmp'/'trn_labels.npy'))
val_labels = np.squeeze(np.load(CLAS_PATH/'tmp'/'val_labels.npy'))


# set up our model parameters
bptt,em_sz,nh,nl = 70,400,1150,3
vs = len(itos)

# select our optimizer
# also pick a batch size as big as you can that doesn't run out of memory
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))
bs = 48

min_lbl = trn_labels.min()
trn_labels -= min_lbl
val_labels -= min_lbl
c=int(trn_labels.max())+1

In the classifier, unlike LM, we need to read a movie review at a time and learn to predict the it’s sentiment as pos/neg. We do not deal with equal bptt size batches, so we have to pad the sequences to the same length in each batch. To create batches of similar sized movie reviews, we use a sortish sampler method invented by @Smerity and @jekbradbury

The sortishSampler cuts down the overall number of padding tokens the classifier ends up seeing.

Note: If documents are different lengths, they should be padded to be the same size. Luckily `fastai` does this automatically

Process Optimizing Note: Put the short documents first (with some randomness)

# create basic text datasets
trn_ds = TextDataset(trn_clas, trn_labels)
val_ds = TextDataset(val_clas, val_labels)


# sort the docs based on size. 
# validation will be explicitly short -> long
# training, which sorts loosely
trn_samp = SortishSampler(trn_clas, key=lambda x: len(trn_clas[x]), bs=bs//2)
val_samp = SortSampler(val_clas, key=lambda x: len(val_clas[x]))

# then we create our dataloaders as before but with a [sampler] parameter
trn_dl = DataLoader(trn_ds, bs//2, transpose=True, num_workers=1, pad_idx=1, sampler=trn_samp)
val_dl = DataLoader(val_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=val_samp)
md = ModelData(PATH, trn_dl, val_dl)

Load our Pretrained Model, and train the last Layer

We will pass hidden layer details

layers=[em_sz*3, 50, c]

em_sz*3 - inputsize
50 - output of first layer
c - output of the 2nd layer

Why x3? - Concat pooling

We take the average pooling over the sequence, the max pooling, and the final pooling and concatenating them all together

pass in drop out details

drops=[dps[4], 0.1]

pass the AWD dropout parameters:

dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])

# setup our dropout rates
dps = np.array([0.4,0.5,0.05,0.3,0.4])*0.5
m = get_rnn_classifer(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
          layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
          dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])

opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

# define our RNN learner
learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip=25.
learn.metrics = [accuracy]

# set our learning rate
# we will use discriminative learning rates for different layers
lr=3e-3
lrm = 2.6
lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])

# Now we load our language model from before
# but freeze everything except the last layer
lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])
wd = 1e-7
wd = 0
learn.load_encoder('lm2_enc')
learn.freeze_to(-1)

# find the optimal learning rate
learn.lr_find(lrs/1000)
learn.sched.plot()

Train the last Layer

A Jupyter Widget
epoch      trn_loss   val_loss   accuracy                      
    0      0.365457   0.185553   0.928719

learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))
learn.save('clas_0')

What if we freeze everything except the last two layers?

epoch      trn_loss   val_loss   accuracy                      
    0      0.340473   0.17319    0.933125

learn.load('clas_0')
learn.freeze_to(-2)
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))
learn.save('clas_1')

Now lets try and train the whole model

Note that the state of the art is 0.941, which is beaten in 3-4 epoches

epoch      trn_loss   val_loss   accuracy                      
    0      0.337347   0.186812   0.930782  
    1      0.284065   0.318038   0.932062                      
    2      0.246721   0.156018   0.941747                      
    3      0.252745   0.157223   0.944106                      
    4      0.24023    0.159444   0.945393                      
    5      0.210046   0.202856   0.942858                      
    6      0.212139   0.149009   0.943746                      
    7      0.21163    0.186739   0.946553                      
    8      0.186233   0.1508     0.945218                      
    9      0.176225   0.150472   0.947985                      
    10     0.198024   0.146215   0.948345                      
    11     0.20324    0.189206   0.948145                      
    12     0.165159   0.151402   0.947745                      
    13     0.165997   0.146615   0.947905

learn.load('clas_1')
learn.unfreeze()
learn.fit(lrs, 1, wds=wd, cycle_len=14, use_clr=(32,10))

Fun thing to try: do the same thing, but reverse the document

When you combine the forward and backward models and average the results, we get a 95% accuracy!
The previous state of the art result was 94.1% accuracy (5.9% error). With bidir we get 95.4% accuracy (4.6% error).

Check out the paper

link

fastai.text performance on text sets

discriminative learning rates - renamed from differential learning rates

gradual unfreezing learning rates - unfreezing a layer at a time

Tweaked approach to cyclical learning rates CLR - based on idea from Leslie smith

Only do 1 cycle that goes up quickly and goes down slower afterwards. Currently implemented in fastai. First number is ratio of highest learning rate to the lowest ratio rate. 32. Second number is the ratio between teh first peak and the last peak. First epoch to be upward, and 9 down, = 10

link to the paper

BPTT - Normal RNN vs. MultiBatch RNN:

Key difference is that hte normal RNN encoder. We can BPTT chunk at a time, and predice the next word. But for the classifier, we need to do the entire doc, or the entire movie review. The entire review could be 2000 words, and we can’t fit all in memory.

Sends back only as many activations as we have decided to keep. If your max is 1000, but your doc is 2000. It will go through and only keep the most recent 1000 activations.

Results vs. customized algorithms

how do the different techniques affect the score?

What happens with different dataset sizes
What happens if different techniques are turned on and off?

jeremy · April 9, 2018, 4:02am

That looks great! My only little issue is I don’t agree that the batch size isn’t really a batch size. I think it is - but there’s also a sequence length. I do agree it’s somehow a little different to how we have seen batches done in the past however…

alessiamarcolini · May 8, 2018, 1:06pm

@jeremy can you turn this post into a wiki so we can copy the markdown, please?

jeremy · May 8, 2018, 5:47pm

Done!

manish · May 23, 2018, 11:18pm

Shouldn’t ‘unk’ map to a different int than ‘pad’?

manish · May 23, 2018, 11:20pm

Nevermind. Its just inserting it to begining of list

abubakar · December 8, 2018, 12:51am

How can I use the fine-tuned language model to generate (sample) text? I could not find any appropriate methods for it. Can anyone guide me on it. (Note: I just started using fastai)