Lesson 3 official topic

jeremy · May 16, 2022, 12:56am

That would be great, and there’s been a lot of work from researchers trying to achieve this goal. It’s hard (or maybe impossible) to do in Python, but the Haskell lib hasktorch already does it IIRC. There was also a prototype made for Swift, although it never really went further.

jeremy · May 16, 2022, 12:57am

I think that’s a bit too advanced for this part of the course, and it’s not really specific to deep learning – any docker tutorial should be sufficient for learning how to deploy a fastai project AFAICT.

(Also, I try to avoid docker wherever possible – and have found I almost never need it. IMO it’s over-used…)

kurianbenoy · May 16, 2022, 2:56am

This is a code from @strickvl blogpost on fashion MNIST, which closely resembles the examples in Fastbook chapter 4. So if I were to use a similar function for calculation of accuracy for all the 10 classes in MNIST,how should it have been? Isn’t this part of step 3 to calculate losses?

def is_dress(x):
    return fashion_mnist_distance(x, mean_training_dress) < fashion_mnist_distance(x, mean_training_pullover)

A idea which came in my mind, was calculating all fashion_mnist_distances, and for the minimum return the argmin() value - the label of distance with most smallest distance.

strickvl · May 16, 2022, 3:24am

Great question for which I have no answer, but extremely curious how this would work. It would seem like it’d be super inefficient to have to calculate the distance for every class separately. What if we had 10,000 classes? I’m thinking there must be some trick for this? Looking forward to the answer that emerges. I’ll maybe try to implement it for all of the 10 Fashion MNIST classes.

jeremy · May 16, 2022, 3:55am

If you keep reading the book, you’ll find that question answered in the coming chapters…

RogerS49 · May 16, 2022, 5:30am

That is why in software engineering practices it was suggested to have coders working in pairs one looking over the shoulder of the other and switching positions throughout the day. Another is a code review performed by a competent person. Alas coding singularly has this feature of miss typing that can be extremely difficult to spot (forest for the trees)

orangelmx · May 16, 2022, 1:59pm

I was wondering if there is something like Which image models are best? for NPL models?

orangelmx · May 16, 2022, 8:30pm

I am trying to compile the NPL notebook on my computer, using the deberta-v3-large, the system stops training and doesn’t run further without any error.

The model deberta-v3-small trains with no problem. anyone else with the same problem?

tapashettisr · May 17, 2022, 12:54am

If we retain the 3-d tensor structure ( for each batch of data) then we also retain the special relationships between the pixels. However to operate on such a tensor we would require a convolution… Conversion of the date to a 2-d tensor does away with these special relations thus degrading the overall info but only a linear operation followed by a non-linear activation at each stage of NN would suffice. As we are probing the NNs here we transform our rank-3 tensors into rank-2 tensors.

mike.moloch · May 17, 2022, 3:13am

I understood it to be simply for being able to multiply them with the weight vector. So we have a 28x28 image which is turned into a 784 numbers in a vector (inside a rank 2 tensor of a few hundreds of these things strung together) and we have 784 weights that we’re going to multiply each of them with. So we transpose x1 (which is 784 numbers representing that image out of x1,x2,…xn ) and vector multiply it with the weights which happen to be another vector containing the 784 numbers.

Re: the “res”, it seems to be carrying forward the calculation result for the two layers and the non-linear activation sandwitched between them. But it’s basically a pass through the network. It’s almost like an accumulator and it gets the result for the first layer, then the activation, and then the second layer as it were and then returns it at the end.

def simple_net(xb):                # xb is incoming 784 pixels
    res = xb@w1 + b1             # layer1 
    res = res.max(tensor(0.0)) # apply activation fn() store back in res
    res = res@w2 + b2            # output of that multiplied with 2nd set of weights
    return res

amir8 · May 17, 2022, 6:19am

I have a question about class names. Hopefully it has not been questioned before. What is the best way to find out class names of a learner associated to the predicted probabilities?

class_name,_,probs = learn.predict(test_pic)

Returns:

TensorBase([0.0338, 0.9662])

I need to know which class do these two probabilities belong to? How to return class names in exact order this Tensor is showing above?

Raymond-Wu · May 17, 2022, 7:35am

Small lighting variations should be fine. Here’s the part of the docs that talk about lighting transformations showcasing under and overexposed images Vision augmentation – fastai

bencoman · May 17, 2022, 11:48am

If I properly understand your question… what you are looking for is in that middle-return-value you are ignoring with “,_,”, which is the index into the probs[] array to the probability matching the first-return-value. Try this…

prediction,pred_ndx,probs = learn.predict(test_pic)
print(f"This is a: {prediction}.")
print(f"Probability: {probs[pred_ndx]:.4f}")

You might find the following useful…

categories = learn.dls.vocab

which you can see used here…
https://huggingface.co/spaces/bencoman/WhichWatersport/blob/main/app.py

I am presuming the element order matches the probs[] array - unless someone advises that is wrong.

bencoman · May 21, 2022, 5:34am

I’d like to clear up a misconception I might have… that dummy variables provided for 2^N categorise, such that 3 dummary variables could encode 8 categories. However while reviewing the Lesson 3 transcript I bumped into the following that contradicted that idea:

So when you create dummy variables for a categorical variable with three levels, like this one, you need two dummy variables. So, in general, categorical variable with n levels needs n-1 columns.

I had a hunt around for info to support my view, but can’t see anything, so I presume I got my wires crossed. So to confirm, 2^N encoding has no place in machine learning?

Also, this artcle was interesting in distiguishing between One Hot Encoding and Dummy Variables…

One Hot Encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a Dummy Variable encoding must be used instead.

jeremy · May 21, 2022, 7:53am

That is correct, pretty much. Although we’ll be learning about something called “embeddings” which give us a touch of that…

bencoman · May 21, 2022, 9:07am

Chapter 4 questionaire asks:

Why do we have to zero the gradients?

for which my answer, copied from the text is:

loss.backward adds the gradients of loss to any gradients that are currently stored. So, we have to set the current gradients to 0 first.

But are there any situations where the gradients are not zerod? Like maybe chaining up phases in some kind of pipeline of operations?

Actually, since writing above, I found a description of accumulating gradients (top answer starting “You are not actually accumulating gradients”) which recommends Calculus on Computational Graphs: Backpropagation that I’m starting to digest.

jeremy · May 21, 2022, 9:12am

Yup exactly - that’s where we only zero the gradients every few times through the loop.

bencoman · May 21, 2022, 1:00pm

In Chapter 4 reading the following code, my mind is split deciding whether “res” is “result” or “residual”, Even though I would guess its the former, the uncertainty is a distraction. I searched a few pages up and down without finding anything definitive. Could this be clarified?

p.s. would a pull request incorporating the answer be useful?

strickvl · May 21, 2022, 1:03pm

I was told by various students that res stands for result . Not 100% sure, though!