Fastbook Chapter 4 questionnaire solutions (wiki)

Thanks for pointing that out. Sorry for the mistake!

1 Like

Both are from PyTorch. F is torch.nn.Functional.


“A nn.Module is actually a OO wrapper around the functional interface, that contains a number of utility methods, like eval() and parameters(), and it automatically creates the parameters of the modules for you.
you can use the functional interface whenever you want, but that requires you to define the weights by hand.”

1 Like

where did you find that it says - “additional function that provides non-linearity” ?
i thought it was the “only” function.

Hi ilovescience hope your having a wonderful day!

Thank you for the outstanding effort you put in to answering the questionnaires.

Although I am still answering the questions myself to help my own understanding I find you efforts a wonderful help to check when I finish my answers.

I would help more but your way quicker than me.

Cheers mrfabulous1 :smiley: :smiley:

1 Like

Fixed! I got confused for a moment there :slight_smile:

I meant it is an additional function that is part of the neural network (apart from y=mx+b). Is it clearer now?

Thanks for the feedback.

Thanks for your feedback. I am glad to hear that people are appreciative of this work :slight_smile:

Haha. I am slowing down now, and getting quite busy with other stuff. I haven’t really gotten a chance to touch the questionnaire for the previous lesson over here :slightly_frowning_face: . I will get back to it this week, but if you want to help, feel free to answer a few questions you think you know the answer to :slight_smile:

1 Like

If you are having troubles with the further research questions at the end of the chapter, I’ve tackled them in this blogpost

7 Likes

Update

I figured out the issue was labels I created indexing incorrectly when the loss function was applied. I’ve updated the code inline and kept my mistake for others to see.

If anyone has thoughts on how I can replace the for loops with broadcasting or if that’s even a good idea let me know!

Original Question

Hey @davidsalazarvergara

I’m wondering if you tried to implement all the code from scratch? I tried to and I think I’m tripping up somewhere in the definition of the loss function and/or the metric. Any help will be much appreciated!

Thanks a lot!
Adi

def create_xy(path):
  inputs = []
  targets = []
  for folder in path.ls().sorted():
    num = int(str(folder).split('/')[-1]) # not needed
    count = 0 #initialise count as zero  
    folder_path = path/'{}'.format(num)
    tensors = [tensor(Image.open(o)) for o in folder.ls().sorted()]
    stacked_tensor = (torch.stack(tensors).float()/255).view(-1, 28*28)
    inputs.append(stacked_tensor)
    target = tensor([count]*len(folder.ls().sorted())).unsqueeze(1) # replaced num with count
    targets.append(target)
    count += 1 # increment count
 
  x = torch.cat(inputs)
  y = torch.cat(targets)

  return x,y

train_x, train_y = create_xy(training) #created tensors from training data
test_x, test_y = create_xy(testing) #created tensors from test data

train_dset = list(zip(train_x,train_y)) #create training dataset
test_dset = list(zip(test_x,test_y)) #create test dataset

train_dl = DataLoader(train_dset, batch_size=256, shuffle=True) #create training dataloader
test_dl = DataLoader(test_dset, batch_size=256, shuffle=False) #create test dataloader

def init_params(size, std=1.0): 
  return (torch.randn(size)*std).requires_grad_()

# initialise weights and biases for each of the linear layers

w1 = init_params((28*28,30))

b1 = init_params(30)

w2 = init_params((30,10)) # 10 final activations

b2 = init_params(10) # 10 final activations

params = w1,b1,w2,b2

# A linear layer in the model
def simple_net(xb): 
    res = xb@w1 + b1 # first linear layer that performs matrix multiplication and creates a set of activations
    res = res.max(tensor(0.0)) # non linear ReLU layer takes activations as inputs and makes all negative values zero
    res = res@w2 + b2 # second linear layer takes inputs from ReLU and performs another matrix multiplication and creates activations
    return res

def cross_entropy_loss(predictions, targets):
  sm_acts = torch.softmax(predictions, dim=1)
  idx = range(len(predictions))
  res = -sm_acts[idx, targets].mean()
  return res

def calc_grad(xb, yb, model):
    preds = model(xb)
    loss = cross_entropy_loss(preds, yb)
    loss.backward()

lr = 0.01

def train_epoch(model, lr, params):
    for xb,yb in train_dl:
        calc_grad(xb, yb, model)
        for p in params:
            p.data -= p.grad*lr
            p.grad.zero_()

def batch_accuracy(xb, yb):
    preds = torch.softmax(xb, dim=1)
    accuracy = torch.argmax(preds, dim=1) == yb
    return accuracy.float().mean()

def validate_epoch(model):
    accs = [batch_accuracy(model(xb), yb) for xb,yb in test_dl]
    return round(torch.stack(accs).mean().item(), 4)

for i in range(20):
    train_epoch(simple_net, lr, params)
    print(validate_epoch(simple_net), end=' ')

I have spent some time working on building a model for the full MNIST problem, my full code is here.

While trying not to use fastai/pytorch built-in stuff, I built my own loss function, in which I tried to generalize what was done during the lesson:

def myloss(predictions, targets):

  if targets.ndim == 1:
    targets = targets.unsqueeze(1)
  
  targets_encoded = torch.zeros(len(targets), 10)
  targets_encoded.scatter_(1, targets, 1)

  return torch.where( targets_encoded==1, 1-predictions, predictions ).mean()

Here I one-hot encode the targets, e.g. 3 becomes tensor([0,0,0,1,0,0,0,0,0,0]) and then apply the same logic as in the lesson. Further down in the code I also test it on a few examples and it indeed behaves as expected.

Nevertheless I see that when training the model, the accuracy increases at first but then drops. Here is a plot showing this behaviour, compared to an identical model using built-in cross entropy as loss:

Screenshot 2020-12-30 at 11.11.41

Digging a bit deeper into what happens, it turns out that myloss is actually pushing all the predictions to be 0, instead of having the prediction corresponding to the target to tend towards one. See the following:

Predictions on a few 0-images from the model trained with myloss:

tensor([[3.9737e-06, 2.3754e-05, 3.7458e-06, 2.1279e-06, 3.1777e-06, 4.1798e-06, 3.5480e-06, 4.4862e-06, 2.9011e-06, 3.1170e-06],
        [3.2510e-05, 1.5322e-04, 2.9165e-05, 2.1045e-05, 2.8467e-05, 3.2954e-05, 3.0909e-05, 3.4809e-05, 2.4691e-05, 2.8036e-05],
        [1.4162e-10, 4.1921e-09, 8.6994e-11, 4.9182e-11, 9.4531e-11, 1.4529e-10, 1.0986e-10, 2.0410e-10, 9.2959e-11, 7.7468e-11],
        [4.8831e-05, 1.5990e-04, 5.0114e-05, 2.7525e-05, 3.4216e-05, 3.3996e-05, 5.0872e-05, 4.6151e-05, 2.8764e-05, 2.9847e-05],
        [1.3763e-05, 6.3028e-05, 1.2435e-05, 8.1820e-06, 1.0536e-05, 1.3688e-05, 1.3276e-05, 1.5969e-05, 8.7765e-06, 1.0267e-05]], grad_fn=<SigmoidBackward>)

predictions on the same 0-images from the model trained with the built-in cross entropy:

tensor([[9.9997e-01, 1.9660e-10, 2.8802e-05, 7.1700e-05, 3.9799e-11, 2.1466e-04, 1.3326e-05, 1.7063e-04, 6.1224e-06, 5.6696e-06],
        [9.9806e-01, 7.7187e-10, 3.2351e-04, 1.9475e-05, 2.1741e-06, 1.4926e-01, 2.7456e-04, 2.0312e-05, 7.7267e-03, 9.0754e-05],
        [7.1219e-01, 4.2656e-10, 2.6540e-09, 6.5700e-04, 9.7222e-09, 4.9841e-04, 3.9048e-07, 5.9277e-09, 6.7378e-04, 6.5973e-07],
        [9.9956e-01, 7.8313e-11, 1.4271e-01, 1.7383e-03, 2.3370e-09, 2.2956e-05, 2.3185e-03, 1.6754e-06, 4.0645e-05, 7.0746e-09],
        [9.9985e-01, 4.5725e-10, 6.3417e-03, 1.8504e-04, 3.7823e-11, 1.4808e-04, 5.6004e-05, 4.3960e-06, 6.0555e-03, 2.3748e-04]], grad_fn=<SigmoidBackward>)

As you can see, in the first bunch of predictions all the numbers are basically 0, while in the second the first column (corresponding to the 0-images in one-hot encoding) are basically 1.

Now it is clear that myloss is not behaving as expected, but I can’t really understand why. Can someone give some help? I have spent so much time looking at it and testing it that I kinda run out of ideas …

Answer to my own post :slight_smile:

After having started the next chapter, I got to know about softmax and its details. I then implemented it in my own code in myloss2:

def myloss2(predictions, targets):

  sm = torch.softmax(predictions, dim=1)
  idx = tensor(range(len(targets)))

  return sm[idx, targets].mean()

and surprise surprise … the result was still the same as above! The model trained with myloss2 had exactly the same behaviour as the one trained with myloss!!!
That’s a shame, because I was very optimistic about using softmax.

Then I went a step further and simply replaced torch.softmax with torch.log_softmax and sm[idx, targets].mean() with F.nll_loss(...).mean() and voila! The model trained with the log version of myloss2 and the model trained with the built-in cross entropy give equivalent results!

So, also in my tiny and simple model I was already getting precision problems and the log got me out of it! Long live the log!

1 Like

I ended up writing a blogpost about this problem I had. I’d say it’s a huge learning for me :slight_smile:

3 Likes

Hey! Can you guys describe me what’s the problem with having 0 as an output? If we add this bias to a zero, then every output that would have been zero would basically become the bias, wouldn’t it?
Also, why do we have to use the “slope-intercept form” to manifest the parameter? I just simply can’t see the relation.

What are the “bias” parameters in a neural network? Why do we need them?

Without the bias parameters, if the input is zero, the output will always be zero. Therefore, using bias parameters adds additional flexibility to the model.

Regarding Question 8. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?

I am assuming the answer is broadcasting , Can someone let me know if there are other operations too?

Thanks,
Karthikeyan Muthu.

Hello,

In question number 27 why it says that view changes the shape of the tensor instead of saying that it can also change the dimension (rank) of a tensor?
Considering that the tensor rank is the number of dimensions or axes that has the tensor and the shape is the size of each axis of the tensor.

   ten = torch.rand(2, 2, 4)
   ten2 = ten.view(-1, 8)
   The output of ten2 is: tensor([[0.7715, 0.2103, 0.0636, 0.5282, 0.7900, 0.3913, 0.6638, 0.5870],
    [0.9369, 0.9811, 0.1984, 0.9920, 0.2802, 0.4329, 0.1696, 0.8414]])
   ten2.shape = torch.Size([2, 8]))

It went from 3 dimensions to 2 dimensions.

So the answer should be: It changes the shape and/or the dimension of a Tensor without changing its contents. Right?

In 04mnist.ipynb jeremy mentioned that the two linear layers and a non linearity can very much approximate any function. Can somone shed more light into it, as i find it hard to understand.

My 2 cents:- Python is inherently a slow language compared to Rust/C/C++/Java. You can write a big loop yourself on millions of numbers just to verify it yourself. One in C and other in Python.

Now, since Python is slow, libraries such PyTorch/NumPy that are written in C provides a way to access them via Python through language bindings. So when you are calling PyTorch equivalent function for a Python function you’re utilising two optimizations:-

  1. Performance gain by using C over Python.
  2. These libraries may be are written in a way to exploit GPU which are 100K times faster than a CPU

This gives you an optimization of Millions times magnitude. This is what I think @jeremy meant in his lecture.

Cheers,
Chetan

Hey, regarding 1st question I just wanted to point out that at the RGB color scale 0s represent black, not white: #000000 Color Hex Black #000. You get white by setting all the colors to 255.

Wikipedia seems to similarly suggest that for greyscale also black is 0 and white is 255. I suppose implementations can vary?

Hi, I wanted to post my work for question 2 of the further research question to get some general feedback on my code and process.

Learner Implementation

class MyOwnLearner:

  def __init__(self,
               data,
               model,
               optimizer,
               loss,
               error,
               val_data):

    self.data = data
    self.model = model
    self.optimizer = optimizer
    self.loss = loss
    self.error = error
    self.val_data = val_data

  def fit(self, epochs, lr):

    self.optimizer = self.optimizer(self.model.parameters(), lr)

    for e in range(epochs):

      for xb, yb in self.data:
        predictions = self.model(xb)
        loss = self.loss(predictions, yb)
        loss.backward()
        self.optimizer.step()
        self.optimizer.zero_grad()

      val_accuracy = self.accuracy(data=self.val_data)
      train_accuracy = self.accuracy(data=self.data)
      print("train accuracy:", train_accuracy, "val accuracy:", val_accuracy)

  def accuracy(self, data):
    accuracy_list = [self.error(self.model(xb), yb) for xb, yb in data]
    return round(torch.stack(accuracy_list).mean().item(), 4)

Data Loading

path = untar_data(URLs.MNIST)

digits = range(0,10)
train_x = []
train_y = []
val_x = []
val_y = []
for i in digits:
  images = (path/'training'/str(i)).ls().sorted()

  val_sample = int(len(images)*.2)
  val_images = random.sample(images, val_sample)
  train_images = [image for image in images if image not in val_images]

  val_images = torch.stack([tensor(Image.open(image)) for image in val_images]).float()/255
  train_images = torch.stack([tensor(Image.open(image)) for image in train_images]).float()/255

  # out = np.zeros(10)
  # out[i] = 1
  # val_out = torch.stack([tensor(out) for _ in range(len(val_images))])
  # train_out = torch.stack([tensor(out) for _ in range(len(train_images))])
  val_out = torch.stack([tensor(i)]*len(val_images))
  train_out = torch.stack([tensor(i)]*len(train_images))

  train_x.append(train_images)
  train_y.append(train_out)

  val_x.append(val_images)
  val_y.append(val_out)

  print(i, 'done')
  # break

x_train = (torch.cat(train_x).float()).view(-1,28*28)
y_train = torch.cat(train_y)
x_val = (torch.cat(val_x).float()).view(-1,28*28)
y_val = torch.cat(val_y)

train_dl = DataLoader(list(zip(x_train, y_train)), batch_size=256, shuffle=True)
valid_dl = DataLoader(list(zip(x_val, y_val)), batch_size=256, shuffle=True)

Modeling

nnet = nn.Sequential(
    nn.Linear(28*28, 200),
    nn.ReLU(),
    nn.Linear(200, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

learner = MyOwnLearner(data=train_dl,
                       model=nnet,
                       optimizer=SGD,
                       loss=nn.CrossEntropyLoss(),
                       error=accuracy,
                       val_data=valid_dl)

learner.fit(epochs=10, lr=0.1)

Results

train accuracy: 0.8655 val accuracy: 0.869
train accuracy: 0.8851 val accuracy: 0.8888
train accuracy: 0.913 val accuracy: 0.914
train accuracy: 0.9288 val accuracy: 0.9276
train accuracy: 0.9389 val accuracy: 0.9365
train accuracy: 0.9449 val accuracy: 0.9412
train accuracy: 0.9472 val accuracy: 0.943
train accuracy: 0.9544 val accuracy: 0.9476
train accuracy: 0.9596 val accuracy: 0.9534
train accuracy: 0.9635 val accuracy: 0.9578

Given the model architecture I think the performance seems reasonable. Obviously think with a CNN architecture the performance would be much better. Interesting thing I’ve seen while reading all of these implementations of Neural Nets is that they start wide and then narrow down to the output layer. I kind of imagine like a funnel. I haven’t seen anything about starting narrow, widening, and then narrow back down. Kind of like diamond shape I guess. Any thoughts about this?

  • Typo, question 1

    • “typicall” to typically
  • Question 3

    • “if its distance to the archetypical 3 is lower than two the archetypical 7.”
      to
      if its distance to the archetypical 3 is lower than two the archetypical 7.
  • Question 7

    • and mean absolute difference (MAE)
      to
      mean ablsolute error(MAE)
    • According to wiki, it is different thing?
      https://en.wikipedia.org/wiki/Mean_absolute_error
      and the book mentions both as same meaning in chapter 4,
      so… also error in the book?
  • Question 14

    • Merge initial information given by original answer and the book
  1. Initialize the weights – Random values often work best
  2. Predict using weights– This is done on the training set, one mini-batch at a time
  3. Calculate the loss – The average loss over the mini-batch is calculated, based on prediction
  4. Calculate the gradient – This is an approximation of how the weights need to change in order to minimize the loss function
  5. Step (that is, change) all the weights based on calculated gradient
  6. Go back to the step 2, and repeat the process.
  7. Stop – In practice, this is either process exceeded time constraint or model’s losses and metrics stop improving.
  • Question 26
    • For clarity
def func(a,b):
   return list(zip(a,b))
  • Question 36
    • second sentence is bit misleading?

F.relu is a Python function for the relu activation function. On the other hand, nn.ReLU is a PyTorch module.
When using nn.Sequential, PyTorch requires us to use the module version.

1 Like