Lesson 8 (2019) discussion & wiki

The idea of developing totally in jupyter is great. I love this tool. But there are still something confused me. After exporting all the dev_nb to nb.py, one need to reintegrate it to the library right ? How can we do that other than copy paste the code ?

How about testing ? I read some thread about testing in fastai and it seems that we test in python environment than jupyter notebook. Is there someone use pytest in jupyter ? I searched about this and someone told that it is not recommended. After reading 2 blogs about Machine Learning pipeline by Radek now I think I need to test my code. The Notebook is so flexible that sometime it makes my environment so messy.

Thank you in advance,

Yup, you copy and paste the code. @sgugger did that to create fastai v1 from the notebooks, and it took him 3 hours total. We still build our new features and tests in notebooks then paste to modules.

I showed one way to run tests in notebooks in this lesson. Check the run_notebook.py script in the course repo.

We have a more complete implementation in the docs directory of fastai, which runs the entire set of documentation notebooks as tests, using a simple pytest plugin.


For anyone who needs remedial work with torch tensors, pytorch has a nice jupyter notebook here: PyTorch Tensor Tutorial

1 Like

Thanks! Could you add that to the wiki top post too?

Thank you so much for your pointing !

I found a good ol’ stage by stage print was helpful to better understand broadcasting matmult:

def matmul(a,b):
    ar,ac = a.shape
    br,bc = b.shape
    assert ac==br
    c = torch.zeros(ar, bc)
    print(a, "a")
    print(b, "b")
    for i in range(ar):
#       c[i,j] = (a[i,:]          * b[:,j]).sum() # previous
        c[i]   = (a[i  ].unsqueeze(-1) * b).sum(dim=0)
        print(a[i  ].unsqueeze(-1),"\n")
        print(a[i  ].unsqueeze(-1).expand_as(b))
    return c

m1 = tensor([[1., 1., 1.],
             [2., 2., 2.]])
matmul(m1, m1.t())


tensor([[1., 1., 1.],
        [2., 2., 2.]]) a
tensor([[1., 2.],
        [1., 2.],
        [1., 2.]]) b


tensor([[1., 1.],
        [1., 1.],
        [1., 1.]])
tensor([[1., 2.],
        [1., 2.],
        [1., 2.]])
tensor([3., 6.])


tensor([[2., 2.],
        [2., 2.],
        [2., 2.]])
tensor([[1., 2.],
        [1., 2.],
        [1., 2.]])
tensor([ 6., 12.])
tensor([[ 3.,  6.],
        [ 6., 12.]])

so you can easily see how a[i] is transformed into a matrix and then the element-wise multiplication, followed by sum leading to c[i] is easy to see (last 3 tensor printouts of each loop).

And further to understand: c[None] > c[:,None]

x = c[None,:] * c[:,None]
c[None] > c[:,None]

gives easy to understand:

tensor([[10., 20., 30.],
        [10., 20., 30.],
        [10., 20., 30.]])
tensor([[10., 10., 10.],
        [20., 20., 20.],
        [30., 30., 30.]])
tensor([[0, 1, 1],
        [0, 0, 1],
        [0, 0, 0]], dtype=torch.uint8)

(I have jupyter configured to print all outputs, not just the last one - handy!)


Hello everyone !

I thought the whole time that:
y_hat = lin2(relu(lin1(x)))
and the we did:
loss = mse(y, y_hat)
to get our loss.

Am I missing something here ?

Thanks for your help !

(not sure if I can @ jeremy for that kind of thing).


Let’s go to the tape, at 1:50:40 or so Jeremy has a correction. @jeremy has a bubble which says

"Oops! I should have said “gradient with respect to the parameters,” not “gradient with respect to th input”

1 Like


Your mathematics is right for the calculations of the intermediate values of y_hat in the forward sequence (left to right) from input to output layers.

My understanding is that Jeremy was conveying that the FINAL y_hat value is being gradually arrived after the back propagation process (right to left from output to input) after calculating mse (in the intermediate training stages) on comparison of intermediate values of y_hat which is lin2(relu(lin1(x))) & the desired output y.

So, you can read Jeremy’s expression as follows :

Final y_hat value arrived at mse applied to repeated comparison of intermediate y_hat values which are lin2(relu(lin1(x))) & y.

I hope this helps.

1 Like

Wow - great article! Coding down to the GPU!

Here is the link to Sylvain’s talk: Fast.ai — An infinitely customizable training loop

Edit: The talk starts at 7:32. I’ve edited the above link to start from the same time. I didn’t add this link to the wiki since I wasn’t sure if this would be covered in a later lesson.


As Jeremy said, “We almost never actually write python code. We write code in python that gets turned into some other language or library and that’s what gets run.” (Lecture 1, 13:45)

I perused some Pytorch documentation to understand how it works under the hood - specifically wrt to low-level data management.

It turns out that the main components of Pytorch code - like Tensors - are converted to lower-level code defined here: https://pytorch.org/cppdocs/

Interestingly, it appears the ATen tensor library is the lowest-level tensor/matrix data type. Its api provides many common tensor operations (e.g. add, ones, randn).

A slightly higher level of abstraction is the torch:: library, which provides data types that seem to manage multiple ATen tensors. (I was looking for the source code module that defines this but couldn’t find it – any guidance would be appreciated :slight_smile: ). This is particularly useful when you want the library to accommodate common use cases like tensor-differentiation during backpropagation (e.g. by invoking a torch:: factory using requires_grad flag, as in:

torch::Tensor a = torch::ones({2, 2}, torch::requires_grad());

I’m trying to wrap my head around how the torch library might be implemented in code. I imagine the largest memory consumption on a gpu while it is processing data must come from the sheer size of the tensors involved. In the gpu implementation of torch, does the library interface with the gpu device’s api (e.g. CUDA on Nvidia devices) for optimized memory and processing-power usage?

I also read through this recommended article a couple times:

I’d like to understand how exactly the higher-order matrix computations are performed logically in the gpu. In the section where the author discusses “element-wise binary operations”, it appears that there’s potentially lots of data (like millions of parameter values) that needs to be managed at one time.

Any comments or suggestions are appreciated!



In Lesson 8 Jeremy talks about making sure you normalize your validation set with the stats of your training set. Does .normalize() handle this for us automatically? I tried to track it through the source code but I couldn’t fully understand it. By experimenting in a notebook I can see that my training set, validation set, and test set are all 3 being normalized, but what stats are being used to normalize when I call .normalize() on my databunch? Is it the mean/stddev of my training set and where can I find this in the source code? I ran aground at self.add_tfm(self.norm). Thank you

It should be in the next lesson, so let’s keep the link for then.
Or maybe don’t listen to me and only to Jeremy :wink:

Yeah, if stats are not provided, it picks a single batch from your data and calculates statistics based on it https://github.com/fastai/fastai/blob/master/fastai/vision/data.py#L177 You can check the batch_stats implementation above.


From that code it looks like it is normalizing from the validation set? The way I’m interpreting it is “If a data loader for the validation set exists then use that, otherwise use the training set”. Then it grabs one_batch of that type and generates all the stats defined by funcs. What am I missing? Thanks

def batch_stats(self, funcs:Collection[Callable]=None)->Tensor:
        "Grab a batch of data and call reduction function `func` per channel"
        funcs = ifnone(funcs, [torch.mean,torch.std])
        ds_type = DatasetType.Valid if self.valid_dl else DatasetType.Train
        x = self.one_batch(ds_type=ds_type, denorm=False)[0].cpu()
        return [func(channel_view(x), 1) for func in funcs]
1 Like

Maybe this can help https://github.com/pytorch/pytorch/issues/2159 and https://discuss.pytorch.org/t/why-does-the-linear-module-seems-to-do-unnecessary-transposing/6277/6

1 Like

That is correct.

Hmmm I’m confused. Jeremy says in lecture here that we should be using the training set stats to normalize both the training set and the validation set. Why does the fastai library use the validation stats? I guess in most cases we probably use split_by_rand_pct and it doesn’t matter, but is there some reason to use validation instead of training?

I think the important thing is that they’re normalized in the same way.