Lesson 4 official topic

For anyone: I found the answer to one of the questions I had, NaN values.

As we learned in the How does a neural net really work notebook, we going to want to multiply each column by some coefficients. But we can see in the Cabin column that there are NaN values, which is how Pandas refers to missing values. We can’t multiply something by a missing value!

Let’s check which columns contain NaN values. Pandas’ isna() function returns True (which is treated as 1 when used as a number) for NaN values, so we can just add them up for each column:

I found this info on the titanic notebook, I am assuming that it means that there is a problem with the built dataset.

Folks, if you need more VRAM you have two ways to get it:

  1. Use a gpu with more vram (obviously). Or…
  2. Look into Gradient Accumulation.

Hope this helps.

3 Likes

I guess I was just thinking about how to leverage some of the ‘ask questions of your documents’ NLP use cases, and wondering if aggressively fine-tuning on your dataset could make a kind of useful search engine for some specific set of documents?

1 Like

I’ll answer my own question. The mistake I made was with the definition of dls_test:

dls_test = TextDataLoaders.from_df(df_2022,text_col='text', label_col="party")

This creates two dataLoaders, a train and a test. It could be corrected as follows:

dl_test = dls_test[0]

Since I spent a bunch of time figuring out how to use a Test Data Set to be used with ‘learner.get_preds’, I share my conclusions here:

How to Create a Test Data Set:

  • It’s probably easiest to create a DataLoader NOT a Dataset
  • It is NOT necessary to indicate which are x,y elements, i.e, no need to indicate the lable. This will be provided by the learner
  • Set valid_pct = 0 to avoid splitting the data into train and valid dataloaders
  • dls = DataLoaders.from_ factory methods create two dataloaders, i.e., train and valid
  • dls[0] is the train DataLoader and dls[1] is the valid DataLoader
  • When using the DataLoader.from_ factory methods to create a test dataLoader, select a single dataloader, i.e., dl_test = dls[0]
  • dl_test is comprised of a single dataLoader containing multiple randomly orderd batches of paired, x,y

Using ‘get_preds’ method

  • The learner can be the created directly through training or else loaded by load_learner("some-export.pkl")
  • learner.get_preds(dl = dl_test)returns two Tensor arrays → predicts, actuals
  • predicts is a TensorText array, itself comprised of TensorText arrays, where each array contains the predicted probabilities of each category
  • actuals is a TensorText array, where each element is the actual catagory
  • In other words, predicts relates to the prediction of the model whereas, actuals is the original lable for that catatory
  • predicts and actuals are all that is needed to determine the accuracy of the predictions

Sample Code

df_2022 = pd.read_csv("blue_red_training_valid.csv")
learner = load_learner("blue-or-red-2022.pkl")
dl_test = TextDataLoaders.from_df(df_2022, valid_pct=0)[0] # Important to choose first dataLoader
predicts, actuals = learner.get_preds(dl = dl_test)
sum(preds.argmax(axis=1).numpy() == actual.numpy())/len(preds)

Out[76]:
0.9300945262695098
5 Likes

thats an interesting thought. I would assume that if we have a use-case where we have no need of “new” documents, then DL doesn’t apply as there is no need to “learn” anything. If we want a search engine, we could simply index it like a crawler.

I think we could overfit two different documents to compare how similar they are, but even here I think a simpler approach would do.

It could make an interesting experiment to check how much we need to overfit to see if the model “memorises” the document. Maybe this could be used for a plagiarism detector? I could be way off here, just typing out my mind.

1 Like

The main reason I’d want the DL part would be to enable a smarter kind of searching and querying of the documents. I wouldn’t just want keyword search, but rather some way to interrogate and ask questions of the fixed set of documents. Ie there would be things that perhaps weren’t named, or concepts expressed that you could understand but that weren’t named etc. I was wondering if this kind of approach ever gets used.

1 Like

I made a separate thread for this, but perhaps it’s best discussed here.

Jeremy mentioned that a CSV file can work well for smaller datasets, but for larger datasets you’ll be working with text files in a folder. This reminds me of the IMDB dataset from an earlier Fast.ai course. As such, I’ve been trying to figure out how I can follow along with this notebook, but using text files in a folder instead, like how the IMDB dataset works:

I can’t seem to figure it out. I haven’t found any examples of this working with the Huggingface library that’s used in the above notebook. I found a Fast.ai/Huggingface integration called blurr which has an example (it doesn’t work out of the box, but I made some adjustments to get it running), but even when I only load a few text files my 16GB GPU runs out of memory:

Does anyone know of a way to use text files in a folder as a data source for the NLP training Jeremy showed us in Lesson 4?

1 Like

I used the fastai from_folder methods for that. You can either use the TextBlock version or the TextDataLoaders version. The docs have examples of how they work.

For a way that works with HuggingFace library, check out my blog where I include code for how to process and access a folder of text files (split by category names and NOT by “train”/“test” split. Note that I defined a function to be able to access the labels, similar to how we did it in lesson 1.

1 Like

Yes, I love the from_folder methods in Fast.ai for this. They allow you to use datasets bigger than the RAM on your machine, as it only stores pointers to the text files. This allows you to train 40GB of text files, on a 16GB RAM (system RAM not GPU RAM) laptop for example, because it doesn’t load everything into memory all at once.

Your blog post (great work btw!) seems to read in all of the text files, and creates a dataframe out of them. You’d need to have at least as much RAM as the dataset takes up, and in my testing, usually a bit more. Have you seen anything like the from_folder methods, but with the HuggingFace library?

Maybe this part of the HF Datasets docs might help? Stream

It seems the key property you’re after is having some kind of streaming / lazily loaded generator which gives you parts of the dataset, one after another. Perhaps this is the way to do it with HF?

1 Like

That sounds like blurr.

The GPU problems you mentioned aren’t related to where the items are stored, but due to your choice of model and batch size. There’s another topic in this category already that discusses solutions to that issue. (Sorry don’t have time to find you the link right now)

1 Like

We should be able to switch it on using this callback, right? I didn’t try it but seems like should be easy to integrate into a pipeline.

1 Like

yep the Gradient Accumulation callback is pretty easy to use :slightly_smiling_face:

2 Likes

We use data augmentation while working with images to increase accuracy and reduce overfitting. For text, are any techniques or algorithms to augment the input text?

1 Like

I’m trying to use the ULMFiT model based on FastBook for multi-class classification and build the learner to see precision and recall as follows:

learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=[accuracy, FBeta(beta=1, average='micro'), Precision(average='micro')]).to_fp16()

But when I fine-tune using fit_one_cycle I see that all the metrics accuracy, precision, and f1 score are shown as the same as follows:

Am I missing something other than setting average=micro while defining the metrics for a multi-class problem?

If you are a Google Colab diehard fan, then here is the NLP lesson for Google Colab. (only a few libraries import before hands, no other changes)

Enjoy hacking.

1 Like

maybe I misunderstood something here. The notebook talks about Text Classification, the model is called AutoModelForSequenceClassification, but the predictions (clipped) sounds to be a number from 0 to 1. Isn’t this a regression?

EDIT: nevermind, I was missing the explication of this in the end of the video, the key here as far I understand is num_labels=1 that actually turns it to a regression problem.

1 Like

Linear Model and Neural Network from Scratch Challenge

:cowboy_hat_face: The big challenge is converting Jeremy’s spreadsheet to Python code independently.

You may attend or watch session #4. Thus, you know how Jeremy recreates it step by step with Python, Pandas, and Numpy. That is OK, but for the challenge, you can’t use the “Linear Model and Neural Network from Scratch” Jupyter Notebook on Kaggle.

If Jeremy can do it on Exel, surely you (and I) can do it with Python. The below layout is a step-by-step challenge. Re-watching session #3 is allowed (but not session #4). It is OK to use Stackoverflow and any resource on the Net.

:+1: WHY?

  • If you can do this, then you are Top Gun, i.e., you understand the core concepts of Neural networks, aka Deep Learning.
  • The goal is NOT to write the most compact and elegant code, but it is for YOU to understand how to code it.
  • …and because it is a fun brain teaser.

I am interested in your posted solution (I think we all do). I am almost done, full of Pandas and graphs everywhere.

Happy hacking.

4 Likes