Fastbook Chapter 4 questionnaire solutions (wiki)

Thank you for your feedback. I think the subtle difference here is that a decorator is not an operator. So the answer should be correct.

1 Like

i think it is 10,12,16,18. :slight_smile:

I think the description of L1 & L2 norm are switched and they should be L1 (MAE) and L2 (RMSE).

Actually L2 is MSE and square root of L2 is RMSE, right ???

Actually should be [[10, 12], [16, 18]]?

1 Like

Thanks for pointing that out. Sorry for the mistake!

1 Like

Both are from PyTorch. F is torch.nn.Functional.

“A nn.Module is actually a OO wrapper around the functional interface, that contains a number of utility methods, like eval() and parameters(), and it automatically creates the parameters of the modules for you.
you can use the functional interface whenever you want, but that requires you to define the weights by hand.”

1 Like

where did you find that it says - “additional function that provides non-linearity” ?
i thought it was the “only” function.

Hi ilovescience hope your having a wonderful day!

Thank you for the outstanding effort you put in to answering the questionnaires.

Although I am still answering the questions myself to help my own understanding I find you efforts a wonderful help to check when I finish my answers.

I would help more but your way quicker than me.

Cheers mrfabulous1 :smiley: :smiley:

1 Like

Fixed! I got confused for a moment there :slight_smile:

I meant it is an additional function that is part of the neural network (apart from y=mx+b). Is it clearer now?

Thanks for the feedback.

Thanks for your feedback. I am glad to hear that people are appreciative of this work :slight_smile:

Haha. I am slowing down now, and getting quite busy with other stuff. I haven’t really gotten a chance to touch the questionnaire for the previous lesson over here :slightly_frowning_face: . I will get back to it this week, but if you want to help, feel free to answer a few questions you think you know the answer to :slight_smile:

1 Like

If you are having troubles with the further research questions at the end of the chapter, I’ve tackled them in this blogpost



I figured out the issue was labels I created indexing incorrectly when the loss function was applied. I’ve updated the code inline and kept my mistake for others to see.

If anyone has thoughts on how I can replace the for loops with broadcasting or if that’s even a good idea let me know!

Original Question

Hey @davidsalazarvergara

I’m wondering if you tried to implement all the code from scratch? I tried to and I think I’m tripping up somewhere in the definition of the loss function and/or the metric. Any help will be much appreciated!

Thanks a lot!

def create_xy(path):
  inputs = []
  targets = []
  for folder in
    num = int(str(folder).split('/')[-1]) # not needed
    count = 0 #initialise count as zero  
    folder_path = path/'{}'.format(num)
    tensors = [tensor( for o in]
    stacked_tensor = (torch.stack(tensors).float()/255).view(-1, 28*28)
    target = tensor([count]*len( # replaced num with count
    count += 1 # increment count
  x =
  y =

  return x,y

train_x, train_y = create_xy(training) #created tensors from training data
test_x, test_y = create_xy(testing) #created tensors from test data

train_dset = list(zip(train_x,train_y)) #create training dataset
test_dset = list(zip(test_x,test_y)) #create test dataset

train_dl = DataLoader(train_dset, batch_size=256, shuffle=True) #create training dataloader
test_dl = DataLoader(test_dset, batch_size=256, shuffle=False) #create test dataloader

def init_params(size, std=1.0): 
  return (torch.randn(size)*std).requires_grad_()

# initialise weights and biases for each of the linear layers

w1 = init_params((28*28,30))

b1 = init_params(30)

w2 = init_params((30,10)) # 10 final activations

b2 = init_params(10) # 10 final activations

params = w1,b1,w2,b2

# A linear layer in the model
def simple_net(xb): 
    res = xb@w1 + b1 # first linear layer that performs matrix multiplication and creates a set of activations
    res = res.max(tensor(0.0)) # non linear ReLU layer takes activations as inputs and makes all negative values zero
    res = res@w2 + b2 # second linear layer takes inputs from ReLU and performs another matrix multiplication and creates activations
    return res

def cross_entropy_loss(predictions, targets):
  sm_acts = torch.softmax(predictions, dim=1)
  idx = range(len(predictions))
  res = -sm_acts[idx, targets].mean()
  return res

def calc_grad(xb, yb, model):
    preds = model(xb)
    loss = cross_entropy_loss(preds, yb)

lr = 0.01

def train_epoch(model, lr, params):
    for xb,yb in train_dl:
        calc_grad(xb, yb, model)
        for p in params:
   -= p.grad*lr

def batch_accuracy(xb, yb):
    preds = torch.softmax(xb, dim=1)
    accuracy = torch.argmax(preds, dim=1) == yb
    return accuracy.float().mean()

def validate_epoch(model):
    accs = [batch_accuracy(model(xb), yb) for xb,yb in test_dl]
    return round(torch.stack(accs).mean().item(), 4)

for i in range(20):
    train_epoch(simple_net, lr, params)
    print(validate_epoch(simple_net), end=' ')

I have spent some time working on building a model for the full MNIST problem, my full code is here.

While trying not to use fastai/pytorch built-in stuff, I built my own loss function, in which I tried to generalize what was done during the lesson:

def myloss(predictions, targets):

  if targets.ndim == 1:
    targets = targets.unsqueeze(1)
  targets_encoded = torch.zeros(len(targets), 10)
  targets_encoded.scatter_(1, targets, 1)

  return torch.where( targets_encoded==1, 1-predictions, predictions ).mean()

Here I one-hot encode the targets, e.g. 3 becomes tensor([0,0,0,1,0,0,0,0,0,0]) and then apply the same logic as in the lesson. Further down in the code I also test it on a few examples and it indeed behaves as expected.

Nevertheless I see that when training the model, the accuracy increases at first but then drops. Here is a plot showing this behaviour, compared to an identical model using built-in cross entropy as loss:

Screenshot 2020-12-30 at 11.11.41

Digging a bit deeper into what happens, it turns out that myloss is actually pushing all the predictions to be 0, instead of having the prediction corresponding to the target to tend towards one. See the following:

Predictions on a few 0-images from the model trained with myloss:

tensor([[3.9737e-06, 2.3754e-05, 3.7458e-06, 2.1279e-06, 3.1777e-06, 4.1798e-06, 3.5480e-06, 4.4862e-06, 2.9011e-06, 3.1170e-06],
        [3.2510e-05, 1.5322e-04, 2.9165e-05, 2.1045e-05, 2.8467e-05, 3.2954e-05, 3.0909e-05, 3.4809e-05, 2.4691e-05, 2.8036e-05],
        [1.4162e-10, 4.1921e-09, 8.6994e-11, 4.9182e-11, 9.4531e-11, 1.4529e-10, 1.0986e-10, 2.0410e-10, 9.2959e-11, 7.7468e-11],
        [4.8831e-05, 1.5990e-04, 5.0114e-05, 2.7525e-05, 3.4216e-05, 3.3996e-05, 5.0872e-05, 4.6151e-05, 2.8764e-05, 2.9847e-05],
        [1.3763e-05, 6.3028e-05, 1.2435e-05, 8.1820e-06, 1.0536e-05, 1.3688e-05, 1.3276e-05, 1.5969e-05, 8.7765e-06, 1.0267e-05]], grad_fn=<SigmoidBackward>)

predictions on the same 0-images from the model trained with the built-in cross entropy:

tensor([[9.9997e-01, 1.9660e-10, 2.8802e-05, 7.1700e-05, 3.9799e-11, 2.1466e-04, 1.3326e-05, 1.7063e-04, 6.1224e-06, 5.6696e-06],
        [9.9806e-01, 7.7187e-10, 3.2351e-04, 1.9475e-05, 2.1741e-06, 1.4926e-01, 2.7456e-04, 2.0312e-05, 7.7267e-03, 9.0754e-05],
        [7.1219e-01, 4.2656e-10, 2.6540e-09, 6.5700e-04, 9.7222e-09, 4.9841e-04, 3.9048e-07, 5.9277e-09, 6.7378e-04, 6.5973e-07],
        [9.9956e-01, 7.8313e-11, 1.4271e-01, 1.7383e-03, 2.3370e-09, 2.2956e-05, 2.3185e-03, 1.6754e-06, 4.0645e-05, 7.0746e-09],
        [9.9985e-01, 4.5725e-10, 6.3417e-03, 1.8504e-04, 3.7823e-11, 1.4808e-04, 5.6004e-05, 4.3960e-06, 6.0555e-03, 2.3748e-04]], grad_fn=<SigmoidBackward>)

As you can see, in the first bunch of predictions all the numbers are basically 0, while in the second the first column (corresponding to the 0-images in one-hot encoding) are basically 1.

Now it is clear that myloss is not behaving as expected, but I can’t really understand why. Can someone give some help? I have spent so much time looking at it and testing it that I kinda run out of ideas …

Answer to my own post :slight_smile:

After having started the next chapter, I got to know about softmax and its details. I then implemented it in my own code in myloss2:

def myloss2(predictions, targets):

  sm = torch.softmax(predictions, dim=1)
  idx = tensor(range(len(targets)))

  return sm[idx, targets].mean()

and surprise surprise … the result was still the same as above! The model trained with myloss2 had exactly the same behaviour as the one trained with myloss!!!
That’s a shame, because I was very optimistic about using softmax.

Then I went a step further and simply replaced torch.softmax with torch.log_softmax and sm[idx, targets].mean() with F.nll_loss(...).mean() and voila! The model trained with the log version of myloss2 and the model trained with the built-in cross entropy give equivalent results!

So, also in my tiny and simple model I was already getting precision problems and the log got me out of it! Long live the log!

1 Like

I ended up writing a blogpost about this problem I had. I’d say it’s a huge learning for me :slight_smile:


Hey! Can you guys describe me what’s the problem with having 0 as an output? If we add this bias to a zero, then every output that would have been zero would basically become the bias, wouldn’t it?
Also, why do we have to use the “slope-intercept form” to manifest the parameter? I just simply can’t see the relation.

What are the “bias” parameters in a neural network? Why do we need them?

Without the bias parameters, if the input is zero, the output will always be zero. Therefore, using bias parameters adds additional flexibility to the model.

Regarding Question 8. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?

I am assuming the answer is broadcasting , Can someone let me know if there are other operations too?

Karthikeyan Muthu.


In question number 27 why it says that view changes the shape of the tensor instead of saying that it can also change the dimension (rank) of a tensor?
Considering that the tensor rank is the number of dimensions or axes that has the tensor and the shape is the size of each axis of the tensor.

   ten = torch.rand(2, 2, 4)
   ten2 = ten.view(-1, 8)
   The output of ten2 is: tensor([[0.7715, 0.2103, 0.0636, 0.5282, 0.7900, 0.3913, 0.6638, 0.5870],
    [0.9369, 0.9811, 0.1984, 0.9920, 0.2802, 0.4329, 0.1696, 0.8414]])
   ten2.shape = torch.Size([2, 8]))

It went from 3 dimensions to 2 dimensions.

So the answer should be: It changes the shape and/or the dimension of a Tensor without changing its contents. Right?