Wiki: Lesson 6

(Rachel Thomas) #1

<<< Wiki: Lesson 5Wiki: Lesson 7 >>>

Lesson resources

Video timelines for Lesson 6

  • 00:00:10 Review of articles and works
    “Optimization for Deep Learning Highlights in 2017” by Sebastian Ruder,
    “Implementation of AdamW/SGDW paper in Fastai”,
    “Improving the way we work with learning rate”,
    “The Cyclical Learning Rate technique”

  • 00:02:10 Review of last week “Deep Dive into Collaborative Filtering” with MovieLens, analyzing our model, ‘movie bias’, ‘@property’, ‘self.models.model’, ‘learn.models’, ‘CollabFilterModel’, ‘get_layer_groups(self)’, ‘lesson5-movielens.ipynb’

  • 00:12:10 Jeremy: “I try to use Numpy for everything, except when I need to run it on GPU, or derivatives”,
    Question: “Bring the model from GPU to CPU into production ?”, move the model to CPU with ‘m.cpu()’, ‘load_model(m, p)’, back to GPU with ‘m.cuda()’, ‘zip()’ function in Python

  • 00:16:10 Sort the movies, John Travolta Scientology worst movie of all time “Battlefield Earth”, ‘key=itemgetter()jj’, ‘key=lambda’

  • 00:18:30 Embedding interpration, using ‘PCA’ from ‘sklearn.decomposition’ for Linear Algebra

  • 00:24:15 Looking at the “Rossmann Retail / Store” Kaggle competition with the ‘Entity Embeddings of Categorical Variables’ paper.

  • 00:41:02 “Rossmann” Data Cleaning / Feature Engineering, using a Test set properly, Create Features (check the Machine Learning “ML1” course for details), ‘apply_cats’ instead of ‘train_cats’, ‘pred_test = m.predict(True)’, result on Kaggle Public Leaderboard vs Private Leaderboard with a poor Validation Set. Example: Statoil/Iceberg challenge/competition.

  • 00:47:10 A mistake made by Rossmann 3rd winner, more on the Rossmann model.

  • 00:53:20 “How to write something that is different than Fastai library”


  • 00:59:55 More into SGD with ‘lesson6-sgd.ipynb’ notebook, a Linear Regression problem with continuous outputs. ‘a*x+b’ & mean squared error (MSE) loss function with ‘y_hat’

  • 01:02:55 Gradient Descent implemented in PyTorch, ‘loss.backward()’, ‘’ in ‘optim.sgd’ class

  • 01:07:05 Gradient Descent with Numpy

  • 01:09:15 RNNs with ‘lesson6-rnn.ipynb’ notebook with Nietzsche, Swiftkey post on smartphone keyboard powered by Neural Networks

  • 01:12:05 a Basic NN with single hidden layer (rectangle, arrow, circle, triangle), by Jeremy,
    Image CNN with single dense hidden layer.

  • 01:23:25 Three char model, question on ‘in1, in2, in3’ dimensions

  • 01:36:05 Test model with ‘get_next(inp)’,
    Let’s create our first RNN, why use the same weight matrices ?

  • 01:48:45 RNN with PyTorch, question: “What the hidden state represents ?”

  • 01:57:55 Multi-output model

  • 02:05:55 Question on ‘sequence length vs batch size’

  • 02:09:15 The Identity Matrix (init!), a paper from Geoffrey Hinton “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units”

Lesson 6 In-Class Discussion
Wiki: Lesson 5
Deep Learning Brasília - Lição 6
Lesson Index
(Hiromi Suenaga) #2

I was reviewing the RNN notebook and have a couple questions.

When we created inputs, we did:
x1 = np.stack(c1_dat[:-2])

Why do we use np.stack and not np.asarray? I experimented some and they seem to do the same thing, but wanted to make sure I’m not missing something in thinking that the purpose of np.stack here is to convert python list to numpy.ndarray:
43 PM

The second question is, why did we omit the last two elements in c1_dat? I made idx shorter to see if I can figure it out:

To me, we could have predicted y[2] = 10 from

  • x1[2] = 7
  • x2[2] = 8
  • x3[2] = 9

And similarly y[3] = 13 from

  • x1[3] = 10
  • x2[3] = 11
  • x3[3] = 12

If I take -2 out, it will look like

I thought maybe it’s dealing with idx length that is not cleanly divisible by 4 but it seems to handle it fine.

Any advice would be appreciated!

Lesson 6 In-Class Discussion
(Alan O'Donnell) #3

@hiromi I was wondering about both of those too. Seems like np.stack works like np.asarray, but np.stack can also take an axis argument.

As far as the index math goes, I would have thought we’d do something like this:

xs = [idx[start:start+cs] for start in range(0, len(idx)-cs, cs)]
y = [idx[start+cs] for start in range(0, len(idx)-cs, cs)]

xs takes the place of x1_dat, x2_dat, etc, or rather their stacked version.

[Edit: Ah, whoops, forgot that we basically do that a little further down in the NB.]

Am I making an off-by-one mistake with those ranges? The largest possible start index for the contiguous character slices would be len(idx)-1-cs: len(idx)-1, since that’s the last index, minus cs, so we have space for the characters plus the target character. E.g.

#     c1   c2   c3   target
[..., 100, 101, 102, 103]
#     l-1-3  ...     l-1, l = len(idx)

So, given that range isn’t inclusive, we drop the -1: range(0, len(idx)-cs,cs).

(Hiromi Suenaga) #4

I don’t see any off-by-one error. I compared the original expanded version and compact version without -1 or -2 (I now see that this one disappears in the compact version), and works as expected. Most error occurs when there are not many items in idx. By removing -1, you can train when len(idx)=4 (just enough to predict the 4th character by the first three):

I have created a PR and see what @jeremy thinks:

@cqfd, could you double check the code change for me?

Thank you!!

(Jeremy Howard) #5

Yeah I guess you’re right in this case - np.stack is going to be different for when axis!=0. I used ‘stack’ here because I think this is a better semantic match for what we’re doing.

Have you tried it without omitting them? It may just be that I didn’t think this thru clearly enough when I put the class together.

(Hiromi Suenaga) #6

Thanks for the response, @jeremy :slight_smile:

I experimented with the entire array (i.e. removed -2) and also per @cqfd’s suggestion removed -1 from:

c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(len(idx)-cs-1)]

They worked nicer for small idx so I created a PR and attached the gist of experiment.


Thanks for bringing up and discussing these points. Had been scratching my head over this :slight_smile:

(Sarada Lee) #8

Hi @jeremy I am aware of there was an update for nlp (update ID#ade043e). I did git pull and conda env update. However, I got the following error message by running from fastai.nlp import * which worked previously. I tried to reinstall spacy but the error persisted. Any idea on how to resolve this problem?

<ipython-input-6-6070169c89a3> in <module>()
     11 from fastai.rnn_reg import *
     12 from fastai.rnn_train import *
---> 13 from fastai.nlp import *
     14 from fastai.text import *
     15 from fastai.lm_rnn import *

~/fastai/courses/dl1/fastai/ in <module>()
      5 from .dataset import *
      6 from .learner import *
----> 7 from .text import *
      8 from .lm_rnn import *

~/fastai/courses/dl1/fastai/ in <module>()
      9 def sub_br(x): return re_br.sub("\n", x)
---> 11 my_tok = spacy.load('en')
     12 my_tok.tokenizer.add_special_case('<eos>', [{ORTH: '<eos>'}])
     13 my_tok.tokenizer.add_special_case('<bos>', [{ORTH: '<bos>'}])

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/ in load(name, **overrides)
     17             "to load. For example:\nnlp = spacy.load('{}')".format(depr_path),
     18             'error')
---> 19     return util.load_model(name, **overrides)

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/ in load_model(name, **overrides)
    110     if isinstance(name, basestring_):  # in data dir / shortcut
    111         if name in set([ for d in data_path.iterdir()]):
--> 112             return load_model_from_link(name, **overrides)
    113         if is_package(name):  # installed as package
    114             return load_model_from_package(name, **overrides)

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/ in load_model_from_link(name, **overrides)
    124     path = get_data_path() / name / ''
    125     try:
--> 126         cls = import_file(name, path)
    127     except AttributeError:
    128         raise IOError(

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/ in import_file(name, loc)
    117         spec = importlib.util.spec_from_file_location(name, str(loc))
    118         module = importlib.util.module_from_spec(spec)
--> 119         spec.loader.exec_module(module)
    120         return module

~/src/anaconda3/envs/fastai/lib/python3.6/importlib/ in exec_module(self, module)

~/src/anaconda3/envs/fastai/lib/python3.6/importlib/ in get_code(self, fullname)

~/src/anaconda3/envs/fastai/lib/python3.6/importlib/ in get_data(self, path)

FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/data/en/'```

(Jeremy Howard) #9

@moody here’s how to install a spacy model:

(Kevin Dewalt) #10

Passing along in case the use of * python operation confuses anyone else …

it = iter(md.trn_dl)
*xs,yt = next(it) #*xs packs the arguments into xs
t = m(*V(xs)) #*V(xs) unpacks V(xs) into m functional arguments for c1, c2, c3.

see this post.

(Dave Castelnuovo) #11

Hello, I’m running through the lesson 6 rnn notebook and am noticing that my Multi-output model is getting significantly different results than the same code in the video.

You’ll notice below, after the first fit method, my val_loss ranges from 2.4 to 2.0 and after the last fit it goes down to 1.99 (which is worst than even the initial rnn implementation). that compares to val_loss from 0.95 to 0.6 in the video.

could there be something in my setup that could be causing the difference in performance? Seems like a large enough variance that there must be something funny going on.


m = CharSeqRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

it = iter(md.trn_dl)
*xst,yt = next(it)

def nll_loss_seq(inp, targ):
    sl,bs,nh = inp.size()
    targ = targ.transpose(0,1).contiguous().view(-1)
    return F.nll_loss(inp.view(-1,nh), targ)

fit(m, md, 4, opt, nll_loss_seq)
epoch      trn_loss   val_loss   
    0      2.600039   2.412278  
    1      2.295203   2.205402  
    2      2.143977   2.093502  
    3      2.046338   2.015161  


set_lrs(opt, 1e-4)

fit(m, md, 1, opt, nll_loss_seq)
epoch      trn_loss   val_loss   
    0      1.996949   1.999792  


(Even Oldridge) #12

I was just listening to this again and I was curious about the autoencoder described around 38m that won the insurance competition? I looked for a recent competition and I’m guessing it’s the Allstate claims severity based on the description and timing but I didn’t see any reference to it in the first place winner’s article:

I’m guessing I’ve got the wrong competition @jeremy? I’d love to see the link to the solution described in the lecture.

(Callum) #13

If anyone gets KeyError: 'ffmpeg' running the SGD notebook (in the animation cell) you probably need to install ffmpeg. In Ubuntu (Paperspace machine) I did:

sudo apt-get update
sudo apt install ffmpeg

Then restarted kernel and it worked. :slight_smile:

(Jeremy Howard) #15

@even Porto Seguro Winning Solution -- Representation learning

(Rodrigo Theodoro Rocha) #16

I’m facing the same issue. If anyone could pinpoint what is causing the differences in output loss values it would be of great value.


(Phani Srikanth) #17

Did anyone try the test function on CharSeqRnn model? In Jeremy’s class, we have seen the test model get_next function on CharLoopConcat model where there was only one output. However, for multi-output case, say CharSeqRnn, there are 8 outputs. I adapted the get_next function to retrieve the indices of all of the 8 outputs by passing it through torch.max and generated the subsequent characters. However, I see gibberish. I’m not sure if I messed up the get_next or is it because of insufficient training.

Did anyone try this? If yes, could you point me to what am I doing wrong? The code I’m using is the following.

def get_next(inp):
    arr = T([char_indices[x] for x in inp])
    p = m(*V(arr))
    # Now, we have 8 outputs. Hence, the max is taken along the first axis to get 8 outputs (8 chars)
    i = np.argmax(to_np(torch.max(p, 1))[0], 1)
    return i

(Sarada Lee) #19

I followed this for Windows environment and then restarted kernel. :slight_smile:

(Sarada Lee) #20

Do you mind sharing the losses as well? Jeremy mentioned that the loss dropped from 1.30 to 1.25 then it started making sense. Try to train the model for another hour or two.

(Dave Luo) #21

Hi @binga,

Your adapted get_next function looks like it’s correctly retrieving and displaying the highest probability 8-char output sequence.

I think it’s actually correct that the first few characters look like gibberish because each n-th output character is trained on the first (n-1) characters in the input sequence.

Said differently, for an input sequence:

[40, 42, 29, 30, 25, 27, 29, 1]

the training label is off-set by one character:

[42, 29, 30, 25, 27, 29, 1, 1]

and the output probabilities are learned from whatever sequence of characters that have run through the RNN previously on that particular forward pass.

So for the 1st output, it’s trying to learn the label 42 from the first and only char input of 40

For the 2nd output, it’s trying to learn 29 from 40, 42

and on so…until the 8th output learning 1 from 40, 42, 29, 30, 25, 27, 29, 1

The earlier characters in the output are gibberish because they don’t have as much of the training sequence to learn from as the later characters.

If you look at just the last characters of each get_next() sequence in your CharSeqRnn test outputs, you’ll note that they match up with the get_next() test results earlier in the notebook from single-output models:

EDIT: sorry about the earlier delete & restore. I thought I understood this correctly but wanted to revisit the lectures to make sure…so now I’m slightly more sure? :slight_smile: Here’s the new stuff I learned:

I think this happens for this particular model because the hidden state is reset to zero at the start of each forward pass:

def forward(self, *cs):
        h = V(torch.zeros(1, bs, n_hidden))

In the lesson 6 video, someone asks a question about this very problem at the 2:06 mark and Jeremy mentions a solution is introduced in lesson 7.

EDIT 2: here’s the section in lesson 7 video where Jeremy explains this problem of throwing away the accrued hidden activations (h) between each 8-char minibatch segment and introduces how Back Propagation Through Time (BPTT) solves this and its related wrinkles:

(Vishvananda Abrams) #22

I have a question about the 3 char model. I notice that the sequence generator does not all possible sequences. Here are the first two sequences:

40, 42, 29 -> 30
30, 25, 27 -> 29

There were two other possible sequences between the above sequences that are not used:

42, 29, 30 -> 25
29, 30, 25 -> 27

Generally I generate sequences with something like an optimized version of the following:

x = []
y = []
for i in range(len(idx)-cs + 1):

This ensures that every possible sequence is included. Is there some reason why half of the sequences were omitted? Or was this an accidental oversight?

EDIT: I may have asked the question a bit too early. It looks like the next section when generating 8 characters uses a method similar to my suggestion.