Thanks for pointing that out. Sorry for the mistake!
Both are from PyTorch. F
is torch.nn.Functional
.
âA nn.Module is actually a OO wrapper around the functional interface, that contains a number of utility methods, like eval() and parameters(), and it automatically creates the parameters of the modules for you.
you can use the functional interface whenever you want, but that requires you to define the weights by hand.â
where did you find that it says - âadditional function that provides non-linearityâ ?
i thought it was the âonlyâ function.
Hi ilovescience hope your having a wonderful day!
Thank you for the outstanding effort you put in to answering the questionnaires.
Although I am still answering the questions myself to help my own understanding I find you efforts a wonderful help to check when I finish my answers.
I would help more but your way quicker than me.
Cheers mrfabulous1
Fixed! I got confused for a moment there
I meant it is an additional function that is part of the neural network (apart from y=mx+b). Is it clearer now?
Thanks for the feedback.
Thanks for your feedback. I am glad to hear that people are appreciative of this work
Haha. I am slowing down now, and getting quite busy with other stuff. I havenât really gotten a chance to touch the questionnaire for the previous lesson over here . I will get back to it this week, but if you want to help, feel free to answer a few questions you think you know the answer to
If you are having troubles with the further research questions at the end of the chapter, Iâve tackled them in this blogpost
Update
I figured out the issue was labels I created indexing incorrectly when the loss function was applied. Iâve updated the code inline and kept my mistake for others to see.
If anyone has thoughts on how I can replace the for loops with broadcasting or if thatâs even a good idea let me know!
Original Question
Iâm wondering if you tried to implement all the code from scratch? I tried to and I think Iâm tripping up somewhere in the definition of the loss function and/or the metric. Any help will be much appreciated!
Thanks a lot!
Adi
def create_xy(path):
inputs = []
targets = []
for folder in path.ls().sorted():
num = int(str(folder).split('/')[-1]) # not needed
count = 0 #initialise count as zero
folder_path = path/'{}'.format(num)
tensors = [tensor(Image.open(o)) for o in folder.ls().sorted()]
stacked_tensor = (torch.stack(tensors).float()/255).view(-1, 28*28)
inputs.append(stacked_tensor)
target = tensor([count]*len(folder.ls().sorted())).unsqueeze(1) # replaced num with count
targets.append(target)
count += 1 # increment count
x = torch.cat(inputs)
y = torch.cat(targets)
return x,y
train_x, train_y = create_xy(training) #created tensors from training data
test_x, test_y = create_xy(testing) #created tensors from test data
train_dset = list(zip(train_x,train_y)) #create training dataset
test_dset = list(zip(test_x,test_y)) #create test dataset
train_dl = DataLoader(train_dset, batch_size=256, shuffle=True) #create training dataloader
test_dl = DataLoader(test_dset, batch_size=256, shuffle=False) #create test dataloader
def init_params(size, std=1.0):
return (torch.randn(size)*std).requires_grad_()
# initialise weights and biases for each of the linear layers
w1 = init_params((28*28,30))
b1 = init_params(30)
w2 = init_params((30,10)) # 10 final activations
b2 = init_params(10) # 10 final activations
params = w1,b1,w2,b2
# A linear layer in the model
def simple_net(xb):
res = xb@w1 + b1 # first linear layer that performs matrix multiplication and creates a set of activations
res = res.max(tensor(0.0)) # non linear ReLU layer takes activations as inputs and makes all negative values zero
res = res@w2 + b2 # second linear layer takes inputs from ReLU and performs another matrix multiplication and creates activations
return res
def cross_entropy_loss(predictions, targets):
sm_acts = torch.softmax(predictions, dim=1)
idx = range(len(predictions))
res = -sm_acts[idx, targets].mean()
return res
def calc_grad(xb, yb, model):
preds = model(xb)
loss = cross_entropy_loss(preds, yb)
loss.backward()
lr = 0.01
def train_epoch(model, lr, params):
for xb,yb in train_dl:
calc_grad(xb, yb, model)
for p in params:
p.data -= p.grad*lr
p.grad.zero_()
def batch_accuracy(xb, yb):
preds = torch.softmax(xb, dim=1)
accuracy = torch.argmax(preds, dim=1) == yb
return accuracy.float().mean()
def validate_epoch(model):
accs = [batch_accuracy(model(xb), yb) for xb,yb in test_dl]
return round(torch.stack(accs).mean().item(), 4)
for i in range(20):
train_epoch(simple_net, lr, params)
print(validate_epoch(simple_net), end=' ')
I have spent some time working on building a model for the full MNIST problem, my full code is here.
While trying not to use fastai
/pytorch
built-in stuff, I built my own loss function, in which I tried to generalize what was done during the lesson:
def myloss(predictions, targets):
if targets.ndim == 1:
targets = targets.unsqueeze(1)
targets_encoded = torch.zeros(len(targets), 10)
targets_encoded.scatter_(1, targets, 1)
return torch.where( targets_encoded==1, 1-predictions, predictions ).mean()
Here I one-hot encode the targets, e.g. 3
becomes tensor([0,0,0,1,0,0,0,0,0,0])
and then apply the same logic as in the lesson. Further down in the code I also test it on a few examples and it indeed behaves as expected.
Nevertheless I see that when training the model, the accuracy increases at first but then drops. Here is a plot showing this behaviour, compared to an identical model using built-in cross entropy as loss:
Digging a bit deeper into what happens, it turns out that myloss
is actually pushing all the predictions to be 0, instead of having the prediction corresponding to the target to tend towards one. See the following:
Predictions on a few 0-images from the model trained with myloss
:
tensor([[3.9737e-06, 2.3754e-05, 3.7458e-06, 2.1279e-06, 3.1777e-06, 4.1798e-06, 3.5480e-06, 4.4862e-06, 2.9011e-06, 3.1170e-06],
[3.2510e-05, 1.5322e-04, 2.9165e-05, 2.1045e-05, 2.8467e-05, 3.2954e-05, 3.0909e-05, 3.4809e-05, 2.4691e-05, 2.8036e-05],
[1.4162e-10, 4.1921e-09, 8.6994e-11, 4.9182e-11, 9.4531e-11, 1.4529e-10, 1.0986e-10, 2.0410e-10, 9.2959e-11, 7.7468e-11],
[4.8831e-05, 1.5990e-04, 5.0114e-05, 2.7525e-05, 3.4216e-05, 3.3996e-05, 5.0872e-05, 4.6151e-05, 2.8764e-05, 2.9847e-05],
[1.3763e-05, 6.3028e-05, 1.2435e-05, 8.1820e-06, 1.0536e-05, 1.3688e-05, 1.3276e-05, 1.5969e-05, 8.7765e-06, 1.0267e-05]], grad_fn=<SigmoidBackward>)
predictions on the same 0-images from the model trained with the built-in cross entropy:
tensor([[9.9997e-01, 1.9660e-10, 2.8802e-05, 7.1700e-05, 3.9799e-11, 2.1466e-04, 1.3326e-05, 1.7063e-04, 6.1224e-06, 5.6696e-06],
[9.9806e-01, 7.7187e-10, 3.2351e-04, 1.9475e-05, 2.1741e-06, 1.4926e-01, 2.7456e-04, 2.0312e-05, 7.7267e-03, 9.0754e-05],
[7.1219e-01, 4.2656e-10, 2.6540e-09, 6.5700e-04, 9.7222e-09, 4.9841e-04, 3.9048e-07, 5.9277e-09, 6.7378e-04, 6.5973e-07],
[9.9956e-01, 7.8313e-11, 1.4271e-01, 1.7383e-03, 2.3370e-09, 2.2956e-05, 2.3185e-03, 1.6754e-06, 4.0645e-05, 7.0746e-09],
[9.9985e-01, 4.5725e-10, 6.3417e-03, 1.8504e-04, 3.7823e-11, 1.4808e-04, 5.6004e-05, 4.3960e-06, 6.0555e-03, 2.3748e-04]], grad_fn=<SigmoidBackward>)
As you can see, in the first bunch of predictions all the numbers are basically 0, while in the second the first column (corresponding to the 0-images in one-hot encoding) are basically 1.
Now it is clear that myloss
is not behaving as expected, but I canât really understand why. Can someone give some help? I have spent so much time looking at it and testing it that I kinda run out of ideas âŚ
Answer to my own post
After having started the next chapter, I got to know about softmax and its details. I then implemented it in my own code in myloss2
:
def myloss2(predictions, targets):
sm = torch.softmax(predictions, dim=1)
idx = tensor(range(len(targets)))
return sm[idx, targets].mean()
and surprise surprise ⌠the result was still the same as above! The model trained with myloss2
had exactly the same behaviour as the one trained with myloss
!!!
Thatâs a shame, because I was very optimistic about using softmax
.
Then I went a step further and simply replaced torch.softmax
with torch.log_softmax
and sm[idx, targets].mean()
with F.nll_loss(...).mean()
and voila! The model trained with the log version of myloss2
and the model trained with the built-in cross entropy give equivalent results!
So, also in my tiny and simple model I was already getting precision problems and the log got me out of it! Long live the log!
I ended up writing a blogpost about this problem I had. Iâd say itâs a huge learning for me
Hey! Can you guys describe me whatâs the problem with having 0 as an output? If we add this bias to a zero, then every output that would have been zero would basically become the bias, wouldnât it?
Also, why do we have to use the âslope-intercept formâ to manifest the parameter? I just simply canât see the relation.
What are the âbiasâ parameters in a neural network? Why do we need them? Without the bias parameters, if the input is zero, the output will always be zero. Therefore, using bias parameters adds additional flexibility to the model.
Regarding Question 8. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?
I am assuming the answer is broadcasting , Can someone let me know if there are other operations too?
Thanks,
Karthikeyan Muthu.
Hello,
In question number 27 why it says that view changes the shape of the tensor instead of saying that it can also change the dimension (rank) of a tensor?
Considering that the tensor rank is the number of dimensions or axes that has the tensor and the shape is the size of each axis of the tensor.
ten = torch.rand(2, 2, 4)
ten2 = ten.view(-1, 8)
The output of ten2 is: tensor([[0.7715, 0.2103, 0.0636, 0.5282, 0.7900, 0.3913, 0.6638, 0.5870],
[0.9369, 0.9811, 0.1984, 0.9920, 0.2802, 0.4329, 0.1696, 0.8414]])
ten2.shape = torch.Size([2, 8]))
It went from 3 dimensions to 2 dimensions.
So the answer should be: It changes the shape and/or the dimension of a Tensor without changing its contents. Right?
In 04mnist.ipynb jeremy mentioned that the two linear layers and a non linearity can very much approximate any function. Can somone shed more light into it, as i find it hard to understand.
My 2 cents:- Python is inherently a slow language compared to Rust/C/C++/Java. You can write a big loop yourself on millions of numbers just to verify it yourself. One in C and other in Python.
Now, since Python is slow, libraries such PyTorch/NumPy that are written in C provides a way to access them via Python through language bindings. So when you are calling PyTorch equivalent function for a Python function youâre utilising two optimizations:-
- Performance gain by using C over Python.
- These libraries may be are written in a way to exploit GPU which are 100K times faster than a CPU
This gives you an optimization of Millions times magnitude. This is what I think @jeremy meant in his lecture.
Cheers,
Chetan
Hey, regarding 1st question I just wanted to point out that at the RGB color scale 0s represent black, not white: #000000 Color Hex Black #000. You get white by setting all the colors to 255.
Wikipedia seems to similarly suggest that for greyscale also black is 0 and white is 255. I suppose implementations can vary?
Hi, I wanted to post my work for question 2 of the further research question to get some general feedback on my code and process.
Learner Implementation
class MyOwnLearner:
def __init__(self,
data,
model,
optimizer,
loss,
error,
val_data):
self.data = data
self.model = model
self.optimizer = optimizer
self.loss = loss
self.error = error
self.val_data = val_data
def fit(self, epochs, lr):
self.optimizer = self.optimizer(self.model.parameters(), lr)
for e in range(epochs):
for xb, yb in self.data:
predictions = self.model(xb)
loss = self.loss(predictions, yb)
loss.backward()
self.optimizer.step()
self.optimizer.zero_grad()
val_accuracy = self.accuracy(data=self.val_data)
train_accuracy = self.accuracy(data=self.data)
print("train accuracy:", train_accuracy, "val accuracy:", val_accuracy)
def accuracy(self, data):
accuracy_list = [self.error(self.model(xb), yb) for xb, yb in data]
return round(torch.stack(accuracy_list).mean().item(), 4)
Data Loading
path = untar_data(URLs.MNIST)
digits = range(0,10)
train_x = []
train_y = []
val_x = []
val_y = []
for i in digits:
images = (path/'training'/str(i)).ls().sorted()
val_sample = int(len(images)*.2)
val_images = random.sample(images, val_sample)
train_images = [image for image in images if image not in val_images]
val_images = torch.stack([tensor(Image.open(image)) for image in val_images]).float()/255
train_images = torch.stack([tensor(Image.open(image)) for image in train_images]).float()/255
# out = np.zeros(10)
# out[i] = 1
# val_out = torch.stack([tensor(out) for _ in range(len(val_images))])
# train_out = torch.stack([tensor(out) for _ in range(len(train_images))])
val_out = torch.stack([tensor(i)]*len(val_images))
train_out = torch.stack([tensor(i)]*len(train_images))
train_x.append(train_images)
train_y.append(train_out)
val_x.append(val_images)
val_y.append(val_out)
print(i, 'done')
# break
x_train = (torch.cat(train_x).float()).view(-1,28*28)
y_train = torch.cat(train_y)
x_val = (torch.cat(val_x).float()).view(-1,28*28)
y_val = torch.cat(val_y)
train_dl = DataLoader(list(zip(x_train, y_train)), batch_size=256, shuffle=True)
valid_dl = DataLoader(list(zip(x_val, y_val)), batch_size=256, shuffle=True)
Modeling
nnet = nn.Sequential(
nn.Linear(28*28, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
)
learner = MyOwnLearner(data=train_dl,
model=nnet,
optimizer=SGD,
loss=nn.CrossEntropyLoss(),
error=accuracy,
val_data=valid_dl)
learner.fit(epochs=10, lr=0.1)
Results
train accuracy: 0.8655 val accuracy: 0.869
train accuracy: 0.8851 val accuracy: 0.8888
train accuracy: 0.913 val accuracy: 0.914
train accuracy: 0.9288 val accuracy: 0.9276
train accuracy: 0.9389 val accuracy: 0.9365
train accuracy: 0.9449 val accuracy: 0.9412
train accuracy: 0.9472 val accuracy: 0.943
train accuracy: 0.9544 val accuracy: 0.9476
train accuracy: 0.9596 val accuracy: 0.9534
train accuracy: 0.9635 val accuracy: 0.9578
Given the model architecture I think the performance seems reasonable. Obviously think with a CNN architecture the performance would be much better. Interesting thing Iâve seen while reading all of these implementations of Neural Nets is that they start wide and then narrow down to the output layer. I kind of imagine like a funnel. I havenât seen anything about starting narrow, widening, and then narrow back down. Kind of like diamond shape I guess. Any thoughts about this?
-
Typo, question 1
- âtypicallâ to typically
-
Question 3
- âif its distance to the archetypical 3 is lower than two the archetypical 7.â
to
if its distance to the archetypical 3 is lower thantwothe archetypical 7.
- âif its distance to the archetypical 3 is lower than two the archetypical 7.â
-
Question 7
- and mean absolute difference (MAE)
to
mean ablsolute error(MAE) - According to wiki, it is different thing?
https://en.wikipedia.org/wiki/Mean_absolute_error
and the book mentions both as same meaning in chapter 4,
so⌠also error in the book?
- and mean absolute difference (MAE)
-
Question 14
- Merge initial information given by original answer and the book
- Initialize the weights â Random values often work best
- Predict using weightsâ This is done on the training set, one mini-batch at a time
- Calculate the loss â The average loss over the mini-batch is calculated, based on prediction
- Calculate the gradient â This is an approximation of how the weights need to change in order to minimize the loss function
- Step (that is, change) all the weights based on calculated gradient
- Go back to the step 2, and repeat the process.
- Stop â In practice, this is either process exceeded time constraint or modelâs losses and metrics stop improving.
- Question 26
- For clarity
def func(a,b): return list(zip(a,b))
- Question 36
- second sentence is bit misleading?
F.relu
is a Python function for the relu activation function. On the other hand,nn.ReLU
is a PyTorch module.
When usingnn.Sequential
, PyTorch requires us to use the module version.