Lesson 4 official topic

jadekang0611 · January 9, 2023, 8:52pm

I’m running this on kaggle and set my path to

path = Path('../input/us-patent-phrase-to-phrase-matching')

However, when I tried to view what’s in the directory I’m getting cannot access warning

ls: cannot access ‘/kaggle/input/us-patent-phrase-to-phrase-matching’: No such file or directory

I also tried to change my Path to ‘/kaggle/inptu…’ still didn’t work.

jadekang0611 · January 9, 2023, 9:23pm

I actually figured it out. I clicked the add button and it works now. Is this what everyone supposed to do as their first step?

maraoz · January 12, 2023, 7:29pm

Hi @silience ! It does seem like everyone’s using Transformers now. That’s probaly why Jeremy decided to show us how to use them in this lesson. I personally found the huggingface framework very cumbersome compared to fastai, though.

Regarding RNNs being “discarded”, I wouldn’t say so! Take a look at this project: GitHub - BlinkDL/RWKV-LM: RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

maraoz · January 12, 2023, 9:17pm

Revisiting NLP and trying to do some fastai-based experiments, I found these two resources useful:

Tutorial on how to do text transfer learning for classification (eg: sentiment analysis) fastai - Text transfer learning
Tutorial on how to do language modeling (ie: text generation, or predicting the next token) using transformers + fastai fastai - Transformers

However, I haven’t been able to find any resources on using transformers for classification, and I haven’t been able to successfully combine the above tutorials. Has anyone else had any luck with using huggingface transformers for classification instead of language modeling? Thanks in advance!
(for the record, I’m playing with Natural Language Processing with Disaster Tweets | Kaggle)

AllenK · January 12, 2023, 9:51pm

Have you looked at blurr - Getting Started ?

ForBo7 · January 13, 2023, 5:02am

There’s this beginner notebook that’s part of the new fastai course, on how to use HF Transformers for classifying patents into 3 categories: Getting started with NLP for absolute beginners | Kaggle

chrwittm · January 13, 2023, 4:53pm

I was re-working lesson 4 by doing the Kaggle competition on disaster tweets. In analogy to what we did in lesson 2, is there also a way in to get the top losses from the training from a hugging face trainer?

In fast.ai there is

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_top_losses(5, nrows=1)

Does something like this also exist for a hugging face trainer? If not, is there a way to at least get the loss for each individual text that you showed the model for classification? Somehow I could not find anything…

Thanks, Christian

bencoman · January 16, 2023, 3:08pm

You should be able to discover your answer with liberal use of:

!pwd
!ls
!cd ..
!find / -name input

chrwittm · January 17, 2023, 7:25am

Re-working Fast.AI lesson 4, I was transferring the approach from Jeremy’s notebook “Getting started with NLP for absolute beginners” to the Kaggle competition “Natural Language Processing with Disaster Tweets”.

When I started this project, I did not expect it to become such an extended endeavor. It introduced me to many different aspects of natural language processing in particular and machine learning in general. To share what I learned with the community, I recorded my approach and the key learnings in this blog post.

In the spirit of producing results quickly and training models early in the development process:

I started by creating a baseline-notebook in which I used the same approach as presented in the lecture, porting it pretty much 1:1.
In the final iteration (so far), I have incorporated quite a few “upgrades”. Which resulted in a score of 0.84676 and out me almost at the top of the leaderboard.

The key learnings:

Cleaning the data helps, both syntactically and semantically.
Upon cleaning the data, keep a close eye on what is noise and what is signal.
Helping the model understand the data helps by using special tokens.
Using bigger models helps. However, for training large models on Kaggle, you need to apply some tricks not to run out of memory.
Small batch sizes help.
Showing the model more data then just the initial training set helps.

More details are in my blog post.

jkato · January 19, 2023, 5:54pm

Thank you for such a detailed and informative lesson! I really enjoyed the data cleansing details you added and learned a lot from that. It feels like each data cleaning function had a story of looking at the data and wondering how do I handle this. Very nice.

I did get a bit lost towards the end, but that is due to me being a total beginner with this field. I’ll go over it again to make sense of it.

Thanks again!

chrwittm · January 27, 2023, 5:21pm

Just when I thought I was done with disaster tweets, I realized I forgot a topic I wanted to cover. In a new notebook version, I implemented a confusion matrix to find tweets which are incorrectly labeled in the training set - basically the same approach as looking for top losses (for example in lesson 2).

I was indeed successful in finding quite a few incorrectly labeled tweets, but (surprisingly) this did not result in a better overall competition result - from my understanding this is a limitation of the dataset. I summarized the full story and my learnings in this blog post.

prasadkulkarni · January 29, 2023, 5:07pm

Hello Guys, I have tried to reproduce Jeremy’s notebook for lesson 4, it has my explanations which can be helpful for beginners like me
Here’s the Kaggle notebook link: NLP for absolute beginners -My explanation | Kaggle

Please let me know if you have any feedbacks.

shtepa · January 30, 2023, 3:46am

Tanishq, I was wondering if you are aware and if so could you please share any sources that implement papers in the tutorial format with at least some brief explanations of why each piece of code is being executed? I am trying to learn to understand thos papers but even when provided along with codes such papers appear to be unaccessible for beginners.

shtepa · January 30, 2023, 3:51am

In the kaggle notebook (Getting started with NLP for absolute beginners) below 17th line of code it is noted that Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.. I got stuck there. Could someone please help to understand how I make sure the word embeddings are fine-tuned?

I have assigned model ‘microsoft/deberta-v3-small’ to a variable model_nm but have not done any installation or training. Do I need to first install or download this model before using it?

Below is the error stack trace

ValueError                                Traceback (most recent call last)
Cell In[41], line 1
----> 1 tokz = AutoTokenizer.from_pretrained(model_nm)

File ~/Desktop/code/ml/venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:676, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    674 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    675 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 676     return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    677 else:
    678     if tokenizer_class_py is not None:

File ~/Desktop/code/ml/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1804, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1801     else:
   1802         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1804 return cls._from_pretrained(
   1805     resolved_vocab_files,
   1806     pretrained_model_name_or_path,
   1807     init_configuration,
   1808     *init_inputs,
   1809     use_auth_token=use_auth_token,
   1810     cache_dir=cache_dir,
   1811     local_files_only=local_files_only,
   1812     _commit_hash=commit_hash,
   1813     **kwargs,
   1814 )

File ~/Desktop/code/ml/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1959, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
   1957 # Instantiate tokenizer.
   1958 try:
-> 1959     tokenizer = cls(*init_inputs, **init_kwargs)
   1960 except OSError:
   1961     raise OSError(
   1962         "Unable to load vocabulary from file. "
   1963         "Please check that the provided vocabulary is accessible and not corrupted."
   1964     )

File ~/Desktop/code/ml/venv/lib/python3.10/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2_fast.py:133, in DebertaV2TokenizerFast.__init__(self, vocab_file, tokenizer_file, do_lower_case, split_by_punct, bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, **kwargs)
    118 def __init__(
    119     self,
    120     vocab_file=None,
   (...)
    131     **kwargs
    132 ) -> None:
--> 133     super().__init__(
    134         vocab_file,
    135         tokenizer_file=tokenizer_file,
    136         do_lower_case=do_lower_case,
    137         bos_token=bos_token,
    138         eos_token=eos_token,
    139         unk_token=unk_token,
    140         sep_token=sep_token,
    141         pad_token=pad_token,
    142         cls_token=cls_token,
    143         mask_token=mask_token,
    144         split_by_punct=split_by_punct,
    145         **kwargs,
    146     )
    148     self.do_lower_case = do_lower_case
    149     self.split_by_punct = split_by_punct

File ~/Desktop/code/ml/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:120, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    118     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
    119 else:
--> 120     raise ValueError(
    121         "Couldn't instantiate the backend tokenizer from one of: \n"
    122         "(1) a `tokenizers` library serialization file, \n"
    123         "(2) a slow tokenizer instance to convert or \n"
    124         "(3) an equivalent slow tokenizer class to instantiate and convert. \n"
    125         "You need to have sentencepiece installed to convert a slow tokenizer to a fast one."
    126     )
    128 self._tokenizer = fast_tokenizer
    130 if slow_tokenizer is not None:

ValueError: Couldn't instantiate the backend tokenizer from one of: 
(1) a `tokenizers` library serialization file, 
(2) a slow tokenizer instance to convert or 
(3) an equivalent slow tokenizer class to instantiate and convert. 
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

You will receive notifications because you created this topic.

Welcome to fast.ai Course Forums — thanks for contributing!

    Be kind to your fellow community members.

    Does your reply improve the conversation?

    Constructive criticism is welcome, but criticize ideas, not people.

For more, see our community guidelines. This panel will only appear for your first 2 posts.
Lesson 4 official topic

chrwittm · January 31, 2023, 7:52am

Are you trying this on kaggle or on your own machine?
Maybe it is something as simple as a missing internet connection in your notebook?

shtepa · January 31, 2023, 5:00pm

I was doing it on a local machine. There seems to be a problem with sentencepiece and in addition to that a problem related to protobuf. I have resolve both of them. One interesting thing that also delayed me for a few hours is this sign ‘_’ I thought it is a lower dash that you can type by holding shift and hittin minus sign, but seems like this is a different sign/symbol. I could reproduce notebook result only when copied the value of that sign/symbol.

In general I try to reimplement the notebook from scratch - that way it is easier to catch all things that might take you off guard in future

shtepa · January 31, 2023, 5:03pm

General question. What is the difference between passing a string to tokz as an argument and passing a string to the tokenize method of the tokz? what is the type of tokz - is it an object or method? when I try type(tokz) below is printed as an output
transformers.models.deberta_v2.tokenization_deberta_v2_fast.DebertaV2TokenizerFast

yakimoff · February 1, 2023, 3:07pm

I’m training the Deberta v3 model on my local machine and have made it through to the point of calling trainer.train(). However, I then receive this error:

/Users/yakimoff/git/fastbook/lesson_4_nlp.ipynb Cell 49 in <cell line: 1>()
----> 1 trainer.train()

File ~/miniconda3/envs/fastai_course/lib/python3.10/site-packages/transformers/trainer.py:1527, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1522     self.model_wrapped = self.model
   1524 inner_training_loop = find_executable_batch_size(
   1525     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1526 )
-> 1527 return inner_training_loop(
   1528     args=args,
   1529     resume_from_checkpoint=resume_from_checkpoint,
   1530     trial=trial,
   1531     ignore_keys_for_eval=ignore_keys_for_eval,
   1532 )

File ~/miniconda3/envs/fastai_course/lib/python3.10/site-packages/transformers/trainer.py:1775, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1773         tr_loss_step = self.training_step(model, inputs)
   1774 else:
-> 1775     tr_loss_step = self.training_step(model, inputs)
   1777 if (
   1778     args.logging_nan_inf_filter
   1779     and not is_torch_tpu_available()
   1780     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
...
   2571         )
   2572     # We don't use .loss here since the model may return tuples instead of ModelOutput.
   2573     loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,token_type_ids,attention_mask.

But for some import statements which are in other cells of my notebook, my code in entirety is this:

df = pd.read_csv(path/"train.csv")
df['input'] = "ANC: " + df['anchor'] + "; X1: " + df['target'] + "; X2: " + df['context']
ds = Dataset.from_pandas(df)

model_nm = 'microsoft/deberta-v3-small'

tokz = AutoTokenizer.from_pretrained(model_nm)
def tok_func(x): return tokz(x["input"])
tok_ds = ds.map(tok_func, batched=True)


eval_df = pd.read_csv(path/'test.csv')
eval_df['input'] = "ANC: " + eval_df['anchor'] + "; X1: " + eval_df['target'] + "; X2: " + eval_df['context']
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)


def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

from transformers import TrainingArguments, Trainer
bs = 128
epochs = 4 
lr = 8e-5

# boilerplate apparently
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=False,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

Everything else evaluates correctly.
Am I missing something obvious?

alterrion · February 3, 2023, 11:36am

I have a question regarding SGD. When the function is x^2 and we choose a random starting point, it makes sense that the gradient will tell us whether to increase or decrease our weights, towards the lowest point. But what if our function is much more complicated - what is stopping the SGD from reaching some kind of local optimal point because it is not able to see that after a brief worsening, it would again get better? It feels like the learning rate would be “confused” because sometimes it gets better and sometimes worse in each direction, and the gradient at the randomly chosen point would not account for that. Terrible picture for reference - red=random point, how would it get to green and not blue? Sorry if this is explored in more detail later on, but it is currently bugging me.

rasmus1610 · February 3, 2023, 4:06pm

That’s indeed kind of a problem with gradient based optimization methods.
A cure is to choose a learning rate that is high enough so it just “jumps over” these local minima. Another thing is a concept called momentum, that is implemented for example in the “ADAM” optimizer. There you update the gradients based on a running weighted average of the previous gradients. That means the “red point” is inclined to keep on moving in the same direction, at least to a certain degree.