Lesson 4 official topic

Hi @duchaba, this is amazing!! :clap: :clap: I think I’m going to try this. Seems daunting actually, but I think definitely worth trying! Thanks for making this notebook and thanks for sharing!

1 Like

I tried the “NLP for beginners” and was able to submit fine. However, when working through “iterate like a grandmaster” I run into issues with generating predictions.

After I train my model and right before “Improving the Model” I want to make predictions and borrow the code from NLP for beginners. It seems that preds = trainer.predict(eval_ds).predictions.astype(float)causes an error because it is non-type. It seems like the model is training fine. I feel I am missing something basic after reviewing the documentation and trying some other solutions.

So in that notebook eval_ds was not processed in the same way as ds was. It also needs to be tokenized and such. I think you get None since the model was not applied since it didn’t see any token_ids in the dataset.

If you replicate the processing appropriately for the eval_ds it should work.

1 Like

Yep thats the issue. As I was comparing values it looked like everything done to ds was being done to eval_ds… but that was not the case.

1 Like

With the Getting Started with NLP for absolute beginners, when run on PaperSpace, did anyone run into this error:

‘SentencePieceProcessor’ object has no attribute ‘encode’

I have tried both pip install datasets and conda install -c huggingface -c conda-forge datasets, incase it is some module conflict.

The code comes to the sample tokenize text function and throws this error

Searched on the web, haven’t got past this on Paperspace.

1 Like

When you get an error like that, try searching the forum. When I searched, I found this:

2 Likes

Thank you. Had to do with sentencepiece module was 0.1.86, required pip install sentencepiece==0.1.96

2 Likes

I am trying a project using HuggingFace. I have been able to follow the Patent Notebook regression classification and tried it with a couple of other models. All good.

I am now trying out a Twitter binary classification project using vinai/bertweet-base . (I don’t want to provide more information than needed if my problem is simple.)

I believe my model to be training as reflected by the decreasing loss function, but this is not being reflected in the Accuracy Metric. Below is the definition of compute_metrics and the results of testing it. (Borrowed from the Hugging Face forums.)

I’ve looked through the Hugging Face Fourms and consulted their online course (hosted by Sylvian). I’ve also posted a similar post there, but am a new user and am quarantined. I can’t seem to figure it out.

Any suggestions on why my Accuracy not changing?

I have really come to appreciate how much I like the fastai libraries. The default behaviors and tolerances to different data types is so appreciated!

from datasets import load_metric
def compute_metrics(eval_pred):
    metric = load_metric("accuracy")
    preds, labels = eval_pred
    preds = np.argmax(preds, axis=1)
    return metric.compute(predictions=preds, references=labels)

x = np.array([[.2,.8],[.3,.7],[.6,.4]])
y = np.array([1,1,1])

compute_metrics([x,y])

Out[103]:
{'accuracy': 0.6666666666666666}

Here are the training results:

Execute Training
In [109]:

trainer.train();
executed in 2m 54s, finished 17:36:39 2022-05-31
The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: handle, __index_level_0__, input. If handle, __index_level_0__, input are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
/home/cdaniels/mambaforge/envs/fastai/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 12827
  Num Epochs = 4
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 204
/home/cdaniels/mambaforge/envs/fastai/lib/python3.9/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
 [204/204 02:51, Epoch 4/4]
Epoch	Training Loss	Validation Loss	Accuracy
1	No log	0.154567	0.586997
2	No log	0.127953	0.586997
3	No log	0.125629	0.586997
4	No log	0.124032	0.586997

I appreciate all thoughts and suggestions.

My guess is that your model is always predicting a single label. Try generating some predictions and take a look at them to see.

In general, the best way to debug a model is to carefully look at the inputs and outputs.

I’m a little behind with the course and just watching Lesson 4. I stumbled upon the part where Jeremy cautions us about using cross-validation, saying that “cross-validation is explicitly not about building a good validation set”. :thinking:

In the blog post from Rachel, I also found this paragraph which may explain better the point.

However, the problem with cross-validation is that it is rarely applicable to real world problems, for all the reasons described in the above sections. Cross-validation only works in the same cases where you can randomly shuffle your data to choose a validation set.

I don’t agree with this point though. Those examples listed in the “New people, new boats, new…” section of the blog are data leaks, as much as many other famous examples in other tabular Kaggle competitions. Some of these data leaks are hard to identify and only become apparent when comparing performance between CV score and Private Leaderboard (or the production environment in the industry).

As a matter of fact, in competitions and real-world applications, we often use stratification and grouping (among other techniques) to guarantee our validation strategy does not suffer from these type of leaks.

I would rephrase this sentence as

Simple k-fold cross-validation only works in the same cases where you can randomly shuffle your data to choose a validation set. In all other cases, special attention must be paid to remove possible data leaks.

The role of cross-validation is to validate the performance of our model on data and conditions similar to the one it will see in production.

Perhaps I completely misunderstood Jeremy and Rachel’s message. After all, fastai goes a step ahead and requires practitioners to always have a validation set.

Related to the post above, but IMHO deserving its own post, I have a question. Why we don’t automatically report in fastai the training score? The validation score is definitely what we are more interested in, I can see that. However, comparing training and validation score would allow us to assess the variance (overfitting) problem.

Being able to measure if we have a bias or variance problem would help practitioners to identify what is the best next step to improve model performance.

No that’s not really correct - when you expect your inference time data to be from a different distribution to your training time data (which is very common) you need to reflect this in your validation set. It’s not a data leak, it’s an actual design issue.

2 Likes

Overfitting occurs when the validation error starts getting worse, and isn’t related to what’s happening in the training set.

Solved:

Ok finally figured it out. Here are my notes for the next person :

  • Specify num_labels=2, i.e., model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2)
  • The exact name of the labeled data needs to be: Labels (Jeremy mentioned this in his lecture.)
  • Only use Integers as Labels. I had Strings as labels and for the longest time 0.0 and 1.0 ,which I had need as floats to fix another problem.
  • Transformers are very picky about data types!

I tried all of these solutions independently: Floats, Strings or Integers as Labels and num_labels=1 or num_labels=2, but it took me two days to get Integers as Labels and num_labels=2.

This really makes me better appreciate the thought and care that has gone into fixing all of these sharp edges and corners in the fastai library!

My Challenge with Binary NPL Classification using Transformers

Thanks! That suggestion really helped me figure out what was happening. Per the attached, I am getting back preds of shape 512x1. As is a binary classifier, the shape of preds should be 512x2. Accordingly, the reason the Accuracy wasn’t changing in my model was because there was only one class, not two as expected. (The np.argmax function had only one argument to choose from.)

I implemented my model based upon the notebook you discussed in class: getting-started-with-nlp.ipynb . I now realize that it was a regression model and what I’m doing is a classification, which is inherently different. I believe I need to have a second class of labels. Does this mean I should be doing one-hot-encoding, one column for each of the two class?

Thanks,

In [223]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

Out[223]:
array([[-0.01423645],
       [ 0.02276611],
       [ 0.02436829],
       ...,
       [ 1.02539062],
       [ 1.02832031],
       [ 1.02441406]])
2 Likes

Congrats on figuring it out! Great tenacity!

It’s true that using fastai can make us have overly high expectations about the behaviour of other libs, that will let us down… :wink:

3 Likes

No that shouldn’t be necessary - just having 0/1 in your dependent column and num_labels=2 should be sufficient.

1 Like

Yup, that’s what ended up working.

In the boat example, IMHO, we could have alleviated the problem by masking the boats or even replacing them with another boat at random to let the model pay less attention to those. This is actually a very good example of why we should always visualize what the model is focusing on with something like Grad-CAM or Attention-MAP.

The new people example could also be resolved by using Group K-Fold cross-validation, in order to replicate the situation of having to make inference on people we haven’t seen in training.

If we are talking about inference time data coming from a different distribution (e.g., time series where inference time data includes the COVID period or other shocks, having to make predictions on pictures taken with a smartphone but having training data coming from high-resolution web pictures, etc.), I see your point.

Sometimes it may not be clear how your test data will differ.

My argument is that, when we don’t know how inference time data will differ, the best thing we can do is to trust our cross-validation strategy. A model that was able to generalize well during cross-validation is our best bet for new unseen data, especially when we don’t know how inference time data will differ.

If we know how inference time data will differ, we could either make our validation data look more like it (e.g., data augmentation, adversarial validation, etc.) or, when applicable, use Bayesian inference to include our prior knowledge about the future into the model.

Anyway, my point here is that I strongly believe cross-validation is a cornerstone of ML and people should do their best to develop a robust validation strategy that closely resembles what the model will see at inference time. I think we both agree with that :slight_smile:

Another situation that is also very hard to validate is when our predictions (and consequent actions) are going to influence future records. There are plenty of examples of that in real life and industry.

You can’t do that with cross validation, only with regular validation.

1 Like

Why is that?

I guess my confusion is about why normal validation is better than cross-validation.

Is that only when we don’t know how inference time data will differ?

I totally get the value of having a test set (e.g., train, dev, test) to avoid overfitting on the dev/validation set.

But what is the problem with cross-validation that a train-test split would solve?