Lesson 4 official topic

jeremy · June 2, 2022, 11:48pm

It’s explained in the linked article we’re discussing - you can’t really use cross-validation for something like testing against the last week here:

Cross-validation requires some way to randomly shuffle your data into multiple non-overlapping pieces which maintains the ability to test on out of distribution data.

iamgianluca · June 3, 2022, 12:51am

We can still do cross-validation with time-series data (a.k.a. backtesting). Here is a blog post from Uber describing the idea.

This approach is quite common in the industry when working with time-series or time-dependent problems.

jeremy · June 3, 2022, 2:13am

True - that’s can be a useful approach. Personally I wouldn’t call that “cross validation” (since they’re overlapping subsets) but I guess I could see how it kinda fits that pattern.

Thanks for sharing!

duchaba · June 3, 2022, 5:50am

I have completed the challenge.

While Jeremy builds and flies the F-16 Fighting Falcon plane (fast.ai), I am putting along on my homemade Wright brothers twin engines airplane, but I am proud of my baby. There are so many lessons learned when you code from scratch (without Jeremy’s notebooks)

Please ping me if you want to compare notebooks. I would like to see how you do it.

Here are a few highlights.

By the end of step #9, I have created the first train method for regression.
- def train_linear_reg (self, y_train, y_param, y, learn_r=0.03, epoch=5):
- I use the same “y_param” (or weights) and different " learn_r, and epoch." Images below show a collection of all runs, and every new run starts with a peak and lower loss rate.
By the end of step #13, I had created the final train method. The only difference from the above method is the "self.predict_y_nn(y_train, y_param) and the new self.fetch_hidden_param(y_train, hidden_param)
- def train_nn (self, y_train, y, learn_r=0.03, epoch=5, hidden_param=2):
- I run it many times with different parameters. Images below show a collection of all runs, and every new run starts with a peak and lower loss rate.
- Notice, it does not matter how many random parameters (hidden layers). Keep training with many epochs, and the error rate goes down.

BTW, I also graph the input (y_train) data to make sure I did it correctly.

Lastly, I use fast.ai every day at work, and now my appreciation of fast.ai increased 1,000 fold.
p.s. I only use Pandas and Pytorch (torch) lib.

jeremy · June 3, 2022, 6:31am

Congrats to you!!! How does it feel to be flying in the sky?

duchaba · June 3, 2022, 6:38am

Best in the world. There are many times I want to reach for your notebooks, but I didn’t.

Booboo [pilot call sign] is on your six.

mindtrinket · June 4, 2022, 2:28am

I am running out of space on Kaggle and I can’t figure out how to remove old checkpoints mid train. Any advice? I assumed it has something to do in the trainer with callbacks but havent found the right setting.

For example, I want to remove checkpoint 500 to free space.

Screen Shot 2022-06-03 at 9.22.17 PM

amanmadaan · July 24, 2022, 9:43am

Some papers for folks who want to dive deeper into using attention as a way to interpret NN outputs:

Attention is not Explanation

Followed by

Attention is not not Explanation

IMO what qualifies as an interpretation/explanation is purely a function of the use case, so attention could certainly be explanation.

EagleUI · July 31, 2022, 5:17am

I use JupyterLab in Paperspace to run lesson 4 notebook. I encountered this error when importing kaggle.

ValueError: Error: Missing username in configuration.

I found the following approach solving this issue.

install kaggle
upload the kaggle.json file to the current working dir
run the following code

# The next two lines make a directory named kaggle and copy kaggle.json file there.
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
# The next line change the permissions of the file.
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets list

clck10 · August 4, 2022, 12:17am

Old bump but this answer was very helpful. Basically with a few HuggingFace args you can save only the best/last model. Playing with these args you can save more/different combo of checkpoints:

jimmiemunyi · August 7, 2022, 10:06pm

Question about the how we are training the model:

So we first turn this into a classification model by creating a new column called input using the following code:

df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

So we end up with our input column for our model.

Jeremy also mention that Huggingface expects our target class to be called labels, so we use this code to rename it:

tok_ds = ds.map(tok_func, batched=True)
tok_ds = tok_ds.rename_columns({'score':'labels'})
dds = tok_ds.train_test_split(0.25, seed=42)

When passing in the data to the trainer, we do the following:

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dds["train"],
    eval_dataset=dds["test"],
    tokenizer=tokz,
    compute_metrics=corr_d,
)

So to recap:

The only tokenized and numericalized column in our data is the new input field we created. (Using the tok_func function I omitted here)
The other fields like context etc are still in text format.
When we pass in our train_dataset in the Trainer, we don’t specify the input field, we just pass in dds[train] which is the train split of our data.

So my question is,

How does the Trainer know specifically which field it should consider the input? Is it the same case where it expects the target to be called labels hence we should always called our input field inputs or it just checks for the numerical field and uses it as the input, or is there something I am missing somewhere?

jimmiemunyi · August 9, 2022, 9:40pm

I have been researching this question and finally found an answer! I will leave the answer here in case someone else has the same question as I did.

The answer is quite simple actually, in the HuggingFace ecosystem, when fine-tuning a pretrained transformer, we can check the names of the fields that the model expects in its forward pass using the tokenizer.

As you remember, we can automatically load the tokenizer of a model using the AutoTokenizer class

from transformers import AutoTokenizer

model_nm = "microsoft/deberta-v3-small"
tokenizer = AutoTokenizer.from_pretrained(model_nm)

We can then check what fields the models expect using the following code:

tokenizer.model_input_names

Which, for our example above, outputs several things including the input_ids field that we got after tokenizing. So basically after tokenization your inputs will have the fields required by the model.

garfieldnate · August 10, 2022, 12:23pm

This cell outputs a weird warning for me:

from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

The warning is:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

This is puzzling because we’ve only used the vocabulary that came with the model. Has anyone else seen this or does anyone know how to get rid of it?

garfieldnate · August 10, 2022, 1:37pm

Just realized there was a cell immediately after telling me I could safely ignore these XD It would be nice to know what’s going on, still.

deelight_del · August 11, 2022, 12:48pm

Has anyone tried changing the default fastai tokenizer to use a subwords tokenizer instead of a word tokenizer like Spacy? If you have, where did you fit it in the ULMFit process?

Also, it’s taking so long to fit_one_cycle the language model on the IMDB dataset using google collab, about 2 hours and the model is still at 8%, is there any way to go faster?

orangelmx · August 23, 2022, 6:25pm

Thank Jeremy, for all your work and sharing with the community, I was never into NLP, and on a kaggle competition, I got a bronze medal.

Google AI4Code – Understand Code in Python Notebooks

ns8wcny · August 26, 2022, 2:33am

Hello,

I have a question about the ‘seq_len’ in DataBlock.dataloaders() in the section of ’ Creating the Classifier DataLoaders’ from 10_nlp.ipynb

seq_len is set to 72, but in the result of the following show_batch(), each of the 3 documents has 150 tokens, not 72. (please see the screenshot below). Why is this?

Thanks

bencoman · August 26, 2022, 8:34am

@ns8wcny, I probably don’t have the knowledge to help, but just a tip… You’ll likely get a better response if you reduce the burden on readers to understand your question. i.e not having to guess where to hunt down the notebook you are referencing. A link to the full notebook wold be useful, plus code extracts and the output of code execution.

ns8wcny · August 26, 2022, 1:35pm

Thanks, Ben! You are right. I edited my question, inserted a link to the notebook, and attached a screenshot. So hopefully, it is easier to read now

vettukal · August 31, 2022, 3:10am

It didn’t work for me in PaperSpace FastAI image. I had to install protobuf version 3.20.1 for it to work.

pip install protobuf==3.20.1

If it still doesn’t work then last option is to simply use

model_nm = ‘microsoft/deberta-base’