Lesson 4 official topic

nerusskyhigh · September 8, 2022, 7:47am

Hi everyone!
I watched lesson 4 and read the Kaggle notebook but I’m still having a hard time understanding what the network is learning/doing.

As far as I understood, the steps we are taking are:

The dataset has the following attributes: anchor, target, context and score where score is what we want to predict.
We rewrite the input to be in the form:

df[‘input’] = 'TEXT1: ’ + df.context + '; TEXT2: ’ + df.target + '; ANC1: ’ + df.anchor
e.g: TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement

We tokenized the input based on the model’s tokenizer
To each part of the token a different id / numerical value is assigned. I guess these ids are sequential and have no real relationship with the word (e.g: if “bird” is 42, “eagle” can be 43 or 1234. The fact that an eagle is also a bird is not important in the id assignation)
We train the network with our dataset

Here comes my main concerns:

Why do we have to rewrite the input like that? Couldn’t we just pass the three strings as three different inputs, or, if a single input is needed, as a shorter string? For example:

df[‘input’] = df.context + '; ’ + df.target + '; ’ + df.anchor
e.g: A47; abatement of pollution; abatement

In my understanding, the constant part is going to be neglected anyway by the model. (! read the edit a the bottom of the page)

How does the model work? Does it take all the token’s ids for a word at once or does it examine them one by one and then average the results?
If I were to test on two words that are not in the dataset (I still have to understand how to do it, sorry for not trying it by myself), would the results be any good?

Sorry for all these questions, but I’m really struggling with this lesson. Any answer is welcome, even “search [topic name] on google”. Right now I’m feeling just lost.

EDIT: In this notebook Jeremy Howard uses a different sentence and still obtains good results. I think it doesn’t matter what the sentence is as long as it contains all the information needed.

Fares · September 14, 2022, 2:33pm

Hi there!
You practically asked the same questions that I had in mind, which is great. I was thinking that if I tried answering your questions, and bounce ideas back and forth, we’d find in each other the answers we were seeking.
Now, I don’t think the way the input is provided truly changes something. Since everything is tokenized, then numerized, the constant part just doesn’t matter.
I know we’re using the Pearson Correlation Coefficient to prove whether any set of variables are correlated. We’re probably comparing list of numbers, but it’s hard to say without any additional info.
What do you think?

jeremy · September 14, 2022, 7:32pm

I’m not sure I’d pick ; as my delimiter. You need something that doesn’t normally appear in regular texts – something that the model can learn represents the start/end of a new field. I guess in this particular case that ; might not actually appear inside any of the fields, so it might be OK, but it’s safer to use a token that is likely to be unique.

jeremy · September 14, 2022, 7:34pm

We’re not really up to the “how does it work” bit yet. The focus at this stage is on using it, not understanding it. As you proceed through the course you’ll build the understanding of how it works.

In short: the tokens are considered all at once, but with respect to their absolute position in the sentence and their relative position to each other.

Fares · September 14, 2022, 9:30pm

Thanks Jeremy!

jstandfast · September 15, 2022, 7:09pm

I was stalled out on my local machine with a protobuf error. I was only able to keep moving by using your ‘last resort’ solution of ‘microsoft/deberta-base’

how did you know this would work?! why does it work?

vettukal · September 16, 2022, 6:47pm

I probably read some tutorials on huggingFace website and they were using this model.

nerusskyhigh · September 19, 2022, 8:56am

Sorry for the late answer, I’ve been quite busy lately.

Now, I don’t think the way the input is provided truly changes something. Since everything is tokenized, then numerized, the constant part just doesn’t matter.

Yeah, seems logical to me. But maybe the constant part may help the network perform better in some way or an other.

I know we’re using the Pearson Correlation Coefficient to prove whether any set of variables are correlated. We’re probably comparing list of numbers, but it’s hard to say without any additional info.
What do you think?

I don’t really know what to think here. I have like an earworm that keeps saying: “If you are using Pearson Correlation Coefficient, why don’t you just use it and forget about ML?”. I think I need to rewatch that part tho. I’ll edit this answer when it’s all more clear.

nerusskyhigh · September 19, 2022, 9:00am

So the “black box approach” was intended here? Good to know! Is it the same for the following lectures of part 1?
Using my previous knowledge about ML (youtube videos mainly) I was able to write a small homework project for lessons 1 and 2. I don’t think I’d be able to do it with lesson 5. Is it normal or I should rewatch the lesson?

liftcookcode · September 21, 2022, 2:30pm

Hi everyone!
I’m currently on Lesson 4 and working through the “Linear model and neural net from scratch” notebook. I’m trying to use the data from the “Titanic” competition. But it seems when I make the api call it’s not working.

I understand the file or directory doesn’t exist, so do I have to move the files into a different directory? Any help on this would be much appreciated!

bencoman · September 21, 2022, 4:22pm

@liftcookcode You don’t mention that you’ve accepted competition rules, so have you done that?

liftcookcode · September 21, 2022, 5:07pm

I believe so.
Yup, just doubled checked so good on that end.

benkarr · September 21, 2022, 8:06pm

The path indicates that you are on kaggle. Just to make shure: is the data actually attached to the notebook, so does it appear in the right sidebar under Data?

liftcookcode · September 22, 2022, 12:39pm

Yup that was the problem!

Thank you!

PeerChristensen · September 27, 2022, 3:27pm

Just a quick tip regarding the Pearson correlation coefficient.

If you want to have a better understanding of the Pearson correlations, you should check out the “Guess the correlation game”:
http://guessthecorrelation.com/

You can also try my version of the game here: https://kap-pch.shinyapps.io/CorrelationGame/

simon_000666 · October 5, 2022, 7:55am

Hi!
Great lesson, one thing tripped me up.
How does the brain do text tokenization and why do we do that part with a ‘classical lexer/procedural’ code? Is there a way to do the tokenisation also with deep learning? What is the ‘speed/size/other’ trade-off that I’m missing?

Thanks!

shainis · October 8, 2022, 2:30pm

Hi, I’d like your help please.

I’ve been working with Jeremy’s notebook Iterate like a grandmaster! | Kaggle.

The only line I’ve changed was the model, from “microsoft/deberta-v3-small” to “AI-Growth-Lab/PatentSBERTa”.

However, this change brings up the following error while trying to train the model:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1 and the array at index 1 has size 9116

Why is this happening? What should I change?

Thanks!

Ersin · October 17, 2022, 12:52am

Can’t run NLP tutorials due to memory errors.

Hi,

I have been trying to follow ULMFit tutorial and keep running into RuntimeError: CUDA out of memory. and can’t complete the tutorial. I’ve tried it both locally as well as using Paperspace fastai setup. Here is the tutorial I’m following: fastai - Text transfer learning

I’m getting this out of memory error on this line:

learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4, 1e-2)

actual error message:
RuntimeError: CUDA out of memory. Tried to allocate 132.00 MiB (GPU 0; 7.94 GiB total capacity; 6.94 GiB already allocated; 124.38 MiB free; 7.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

any pointers are greatly appreciated.

bencoman · October 17, 2022, 3:04pm

Maybe batch size and Gradiant Accumulate will help… https://youtu.be/p4ZZq0736Po?t=528

acal · October 17, 2022, 8:55pm

Same here – I was only able to avoid this error on Paperspace by doing model_nm = ‘microsoft/deberta-base’