Lesson 4 official topic

In the Chapter 4 notebook and video Jeremy mentions that we can frame the similarity problem as a classification problem

It turns out that this can be represented as a classification problem. How? By representing the question like this:

For the following text…: “TEXT1: abatement; TEXT2: eliminating process” …chose a category of meaning similarity: “Different; Similar; Identical”.

In the above statement there’d be 3 discrete categories/labels: Different | Similar | Identical

But in the code we treat the score column (which can take any float value from 0 to 1) as labels

// score == labels
tok_ds = tok_ds.rename_columns({'score':'labels'})

Question

My expectation is that for classification tasks labels should be a discrete set of values, however in the notebook labels is a continuous value between 0 to 1.

  • Is the expectation correct?
  • If yes, how does treating score as labels work for this case?

Your expectation is correct that this is a classification problem with discrete categories. However the 3 discrete categories/labels of Different | Similar | Identical is just an example given to illustrate how one would phrase this problem as classification. In the actual dataset, the categories are as follows:

Near the end of the lesson video, one of the students asks a related question to yours and Jeremy explains that in HuggingFace when using AutoModelForSequenceClassification with num_labels=1 (meaning there is one column as the output), it automatically turns it into a regression problem.

In this case, since the metric used in the competition is Pearson coefficient, it works out okay because it’s looking to see how correlated the predictions are to the actual values, so a 0 to 1 range works just fine and a strictly categorical output is not needed.

2 Likes

When I read this part in Chapter 10 of the book:
Going back to our previous example with 6 batches of length 15, if we chose a sequence length of 5, that would mean we first feed the following array:
I have a question:
Why is the sentence order spanned across mini-batches in the language model trained by RNN in this case?
For example, if there are three mini-batches here:

  • The first sentence is in the first data row of the first mini-batch. (xxbos xxmaj in this chapter)
  • The second sentence is in the first data row of the second mini-batch (why isn’t it in the second data point of the first mini-batch?). (, we will go back)
  • The third sentence is in the first data row of the third mini-batch. (over the example of classifying)
  • The fourth sentence is in the second data row of the first mini-batch. (movie reviews we studied in)
  • etc…

The crucial point I don’t understand may be: in the case of parallelization using mini-batches in RNN, which segment of the mini-batch corresponds to adjacent training in terms of RNN training?

2 Likes

Good question! It’s something I was thinking about too.

I guess the parallelization is carried out on the sequences or on rows, so to speak, of the mini-batches. That is, at a time, the first sequence (first row) of each mini-batch in a batch is used for training.
I don’t know if my answer is entirely accurate, but it does make sense for me. Do let me know your thoughts :slight_smile:

Hi everyone,

I am going through chapter 4 of the book and am stumbling over a very fundamental (probably stupid) question. Maybe anybody can share his/her intuition on the subject.

If I look at the linear model that is used for the 3’s and 7’s example

def linear1(xb): return xb@weights + bias
preds = linear1(train_x)

it is stated that to decide if an output represents a 3 or a 7, we can just check whether the functions output it’s greater than 0.0.

Why is this the case?

Thanks!

P.S.:
There is one place in chapter 4 that checks for preds > 0.5:

def batch_accuracy(xb, yb):
preds = xb.sigmoid()
correct = (preds>0.5) == yb
return correct.float().mean()

This looks like a typo as the preceding paragraph was fixed in the latest version of the book to read

“We also want to check how we’re doing, by looking at the accuracy of the validation set. To decide if an output represents a 3 or a 7, we can just check whether it’s greater than 0. So our accuracy for each item can be calculated (using broadcasting, so no loops!) with:”

Whereas in my hardcopy, it reads “whether it’s greater than 0.5”.

This forum post has a good explanation for this. In short, the predictions coming out of linear1(train_x) are centered around 0 with both negative and positive values, so 0 is a good threshold for the binary classification. Later on, once predictions are passed through sigmoid, the threshold is then 0.5.

1 Like

Thank you for your answer. I have to say, though, that I am still not 100% convinced. I totally understand that the midpoint for the weights is 0 after using torch.randn and that by multiplying these values with the image data where each pixel is a value between 0 and 1 this distribution remains the same. I also understand how this changes to 0.5 after the sigmoid function is applied. But that still doesn’t explain why at the end, once the model has been trained and the weights look completely different, this distribution is still present like that.

Let me ask differently then: How come that the model produces values > 0 in case of a 3 and < 0 in case of a 7? If we knew that, the accuracy would come naturally the way it is defined.

Thanks for helping understand this better!

1 Like

Great question, it’s helping me think more thoroughly about my understanding of what’s going on.

I’ll try and answer your question—I think that’s where the loss function comes into play:

def mnist_loss(predictions, targets):
    predictions = predictions.sigmoid()
    return torch.where(targets==1, 1-predictions, predictions).mean()

By minimizing the loss, which is 1-predictions for when the target is 1 (the digit 3) and predictions when the target is 0 (the digit 7), the model learns to make predictions that are large and positive for when the digit is 3 (large positive values get closer to 1.0 after being passed through sigmoid) and large and negative for when the digit is 7 (large negative values get closer to 0.0 after being passed through sigmoid).

2 Likes

Thank you very much for this!

I like that explanation a lot. If I look at the distribution of predictions after training with the above loss function, I see

If I now tweak the loss function to something else (complete nonsense, of course!)

def mnist_loss(predictions, targets):
    predictions = predictions.sigmoid()
    # mind the change of the second argument from 1-predictions to 1+predictions
    return torch.where(targets==1, 1+predictions, predictions).mean() 

I see a different distribution:

As expected, changing the loss function will lead to the predictions being “optimized” differently.

2 Likes

Hey gang!
I published my summary and quiz responses for lesson 4 on my blog.
This post includes another persistent animal for your enjoyment :grin:

This is in the lesson 4 notebook. This is probably a Python newbie question - definitely a pandas newbie question, but, I’m trying to fundamentally understand how pandas modifies all records in this assignment and adds a new column:

import pandas as pd
df = pd.read_csv('train.csv')
print(df.head())
df['input'] = 'TEXT1: ' + df.context + ' TEXT2: ' + df.target + ' TEXT3: ' + df.anchor
print(df.head())

With what looks like a single concatenated string assignment, pandas has created a new column for all rows and modified the column according to the logic in the string concatenation. I’m coming to Python from other languages, so is this a pythonic thing? Or has pandas overridden object property assignment and they’re using the expression as a shortcut to modify all rows? In another language you’d just end up with df[‘input’] as a property with a single string value - or it might throw an error because e.g. df.context isn’t a variable that can be concatenated.

I’ve had a look at the pandas docs and no luck, poked around regarding python overriding assignment operators and found that it is possible to do this for object properties but didn’t immediately see that in Pandas or an explanation of how it works. Just need a pointer on what to look up to understand this behavior.

Thanks in advance,

Mark.

Python is weird!

You can override operators, using what are called “dunder” methods. In this case pandas dataframe defines __add(b)__, which overrides the + operator.

Here’s the API: pandas.DataFrame.__add__ — pandas 2.1.3 documentation

Thanks very much Zander, that’s incredibly helpful.

1 Like

I don’t have a sufficient concise answer to elaborate on what zander posted, but here are some concepts that may help:

  • You can access columns in a DataFrame either by indexing (df['input']) or as its attribute (df.context).
  • pandas is built on top of NumPy, and the pandas Series is built on NumPy’s ndarray.
  • From the NumPy docs:

At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance.

  • NumPy has the ufunc (universal function) which is:

a function that operates on ndarrays in an element-by-element fashion, supporting array broadcasting, type casting, and several other standard features. That is, a ufunc is a “vectorized” wrapper for a function that takes a fixed number of specific inputs and produces a fixed number of specific outputs.

  • Here are their docs on ufunc basics, API reference, NumPy C Code Explanation and a guide on how to write your own ufunc (the introductory text is somewhat helpful, the rest of it I don’t understand).
    • As an example, here is the source code for the add method for character arrays, which returns the numpy.add function which is a ufunc. I can’t spell out exactly how this translates to the line of code you are referencing, but conceptually it’s related.
  • Here is NumPy’s description of broadcasting which comes up a lot when working with pandas (and also PyTorch). The single strings 'TEXT1: ', 'TEXT2: ' and 'TEXT3: ' are “broadcasted” to all elements in the columns df.context, df.target and df.anchor when they are concatenated with the + operator.

I also prompted ChatGPT with some questions around this topic and reading the responses may spark further inquiry.

I have a question regarding the MNIST deep learning model that was built in Chapter 4. As I understand it, our intent was to classify the input into on of two categories, threes and sevens, why then did we decide to use a linear model for that purpose? Personally, the first idea that came to my mind was a classification model such as logistic regression.

I attempted to run the chapter 10 notebook on Paperspace using one of the free machine configurations including GPU. It failed on the model tuning step, complaining about running out of GPU memory:

My question isn’t primarily about the advice the error text provided (although I would not turn down advice on the advice), but about the claim that PyTorch reserved 5.37 GiB of the 7.79 GiB GPU memory capacity. Is that usual, and if so, doesn’t that define a floor on GPU requirements that makes free resources not so useful?

NLP to unmask Satoshi Nakamoto?

I had this half baked idea after seeing a news piece on bitcoin. Could NLP help to identify Satoshi Nakamoto, the author of the original bitcoin whitepaper? Satoshi Nakamoto is a pseudonym, the real identify of the author remains a mystery.

Could NLP help identify the author’s real name? A couple half baked approaches:

Use the Abstract as the known, author as unknown, Journal of Cryptography up to the publication date of the paper as the data set. The test dataset is just the single bitcoin paper. This relies on the assumption that the author has published in Journal of Cryptography previous to the bitcoin paper, and abstracts being unique enough to different authors. Multiple authors on a paper would be a sticky point, treat them as one author/token in the vocabulary? Use ULMfit due to the size/number of taokens required for abstracts. Getting all of the abstracts and authors quickly and easily is another sticky spot - does the Journal of Cryptography offer an API?

Or, use the authors in References as the known, paper author as unknown, again using the J Cryptography. This has the assumption that researchers tend to reference certain other researchers more frequently. And the same challenge here of getting all authors and the References.

1 Like

No such file or directory: ‘/root/.fastai/data/imdb/models/finetuned.pth’

I’m not sure if this line: learn.save_encoder(‘finetuned’)
is working properly, because later in the notebook this line: learn = learn.load_encoder(‘finetuned’)
throws the error shown above.

Does anyone have an idea of what is going on, where my mistake may be?
Thanks,
Chris

Hi all, this is a silly question but I’ve not been able to resolve it on my own. How do I save/load the encoder while in Kaggle?

I am trying out the ULM fit method on a dataset using Kaggle, and I’m having issues saving and loading the finetuned encoder in Kaggle. I’m quite certain it’s a directory issue due to the fact that the error shows it’s trying to pull from fastai’s data location. I tried to set my model directory to kaggle/working using the below code.

code snippet:
learn_lm.model_dir = ‘/kaggle/working/’
learn_lm.save_encoder(‘finetuned’)

When I run the line below:
learn_lm = learn_lm.load_encoder(‘finetuned’)

I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: ‘/root/.fastai/data/imdb_sample/models/finetuned.pth’

link to my notebook:
IMDB ULMFit

Thank you in advance!

thanks. i just made sure i copied jeremy’s notebook (which is linked to the patent-competition already)