Wiki: Lesson 7

Jeremy, can you make tutorial more about dataloader for nlp tasks thank :slight_smile:

I’m not sure if its just me but the images were not in the directories like the library was expecting. After downloading and renaming the top level directory to cifar10, I ran this script in each of the test and train folders to put them into labeled directories.

declare -a classes=(“plane” “automobile” “bird” “cat” “deer” “dog” “frog” “horse” “ship” “truck”) && for i in ${classes[@]}; do mkdir ${i} && mv *${i}.png ${i} ; done

1 Like

It’s not just you, I’m also having the same issue.

This is how the dataset layout looks:

$ curl -sL | tar -tzf- | head

I’ve wanted to make a PR, but it seemed somewhat intrusive to make a pull request for the notebook.

Here is my Python solution:

def to_label_subdirs(path, subdirs, classes, labelfn):
    for sd in subdirs:
        for rf in os.listdir(os.path.join(path, sd)):
            af = os.path.join(path, sd, rf)
            if not os.path.isfile(af):
            lb = labelfn(rf)
            if not lb:
            os.renames(af, os.path.join(path, sd, lb, rf))

Then, somewhere before the definition of get_data:

to_label_subdirs(PATH, 'train test'.split(), classes, lambda f: f[f.find('_')+1 : f.find('.')])

Also, It’s a good idea to ensure that we’ve moved all images to their corresponding label directories:

!find {PATH}train {PATH}test -maxdepth 1 -type f | wc -l

The output should be 0.


Now it’s not so nice because other threads pushed the wiki threads out of the top.

@rachel Is it possible to pin the wiki threads? They are very important for the course, but now one should search the forums to get to them. It can be quite confusing for people that are just starting the course (it certainly was confusing for me). There are links to wikis from the pages with video lectures, but I guess it would be nice to have them organized in one place.

The other option is to create an index page with links to the wikis, and pin the index page instead.

Thanks for this trick !

and use accuracy_np also !

Can someone explain it at
Why SGD will undo the normalization while BatchNorm works?

@jeremy Could you elaborate that part a bit. After reading some articles, I get the intuition that why adding extra parameters could help in BatchNorm. But I still couldn’t get what do you mean when you are saying

  1. SGD will undo it
  2. Why adding scaling parameters address this “undo” issue.

You can try to read the original BatchNorm paper, it’s quite accessible. In section 2, they give a little example of a layer that adds a learned bias, and then centers the result (that is, subtracts the mean). It turns out that if you write the expression for the layer output after the gradient update, the bias update term cancels out. So, even when the optimization procedure changes the bias parameter, the update doesn’t change neither the layer output nor the loss.

It is not shown explicitly, but they claim that the same thing happens if you scale the input.

Here is a relevant excerpt from the paper (sorry for posting it as an image, but I can’t find how to typeset math in Discourse):


1 Like

Thx, I actually just printed the paper out, will have a look soon.

Note that you need to use accuracy_np(preds,y) rather than accuracy(preds,y)

1 Like

Unless something change, during the January 2018 session I runned it with accuracy() function and all worked perfectly.
Even if accuracy_np() function works too, I think the sole difference is that accuracy() is the torch version and accuracy_np() is a numpy version of accuracy; please look at the function here :

Given the things, as the fastai library continuously evolves (new improvements), there is also a new cpu only version I will try later, if I get a bit of free time.

1 Like

Here’s a go from my level of understanding… Lets say SGD wants to push an activation function up, i.e. increase the mean (but here maintain the shape / variance). BN would reduce it, SGD would push up - but there’s no easy way for it to do so, so it would try changing lots of weights. BN would reduce it. = fail.
Now - all the previous weights stay the same, SGD changes the 1 scaling procedure and up it goes SGD happy, BN happy.
you can do a similar thought experiment for the m multiplier

How can I draw the CAM plot for the Dog class? I’m confused where the Cat class is specified in


I have a question regarding the stateful RNNs, and the behaviour of the hidden state during training and inference.

Training phase
Within a given chunk of text (one of the 64 pieces of text if batch size = 64 for example), assuming bptt = 8 as in the notebook example.

  1. Batch 0. We take Characters 0 to 7 to predict characters 1 to 8. The hidden state at the end of batch 0 corresponds to the hidden state value to predict character 8 (result of the RNN time iteration at character 7).
  2. Batch 1. We take Characters 8 to 15 to predict characters 9 to 16. For the first time iteration, we have the hidden state corresponding to the previous prediction at step 7, which contains the history of steps 0 to 6. Therefore this allows making stateful predictions with a hidden state taking the old values into account and the new and old positions match for the hidden state.

Inference phase
Reading at the code at the end of the notebook, the last cell generates a 400 characters prediction. The text is built character by character as follows:

  1. Use characters 0 to 7 to predict character 8. Hidden state corresponds to character 7.
  2. Use characters 1 to 8 to predict character 9. For the first character we take a hidden state “representing” the history at character 7 when we start the loop and make predictions for character 2.
  3. Use characters 2 to 9 to predict character 10. For the first character we take a hidden state “representing” the history at character 8 to make predictions for character 3.

Basically I feel like the hidden state we should use as we loop through is not the hidden state at the end of a BPTT sequence, but the hidden state at the second iteration of the RNN loop.

Did I miss something?

hi, just started to watch the videos
when talking about the same thing, it’s described as
48:50 - neural net one hidden layer
49:25 - neural net no hidden layer

I’m a little confused about this line regarding stateful RNN:

m = CharSeqStatefulRnn(md.nt, n_fac, 512).cuda()

We pass in 512 as batch size, but it should be 64, right? (since we set bs=64 and pass it in as a param for LanguageModelData.from_text_files). I logged out the size of self.h and after the first iteration the dimension corrects to [1, 64, 256], since we check if self.h.size(1) != bs, but I was a bit confused where 512 came from.

@rachel @jeremy

Have written a blog on Generating your own music using RNNs. Hope you enjoy it.


Has anyone else tried to get the CAM things at the end working on specific input images. Can get an prediction out for test image (using learn.predict_array) but struggling to work out how to get heatmap to see different parts of images for individual test image.

@jeremy I have a doubt regarding resnet.
The resnet block simply does the operation: x + f(x) where x is the input and f(x) is output of BnLayer. But the dimensions of input and output won’t be same so how can these two be added?