For anyone: I found the answer to one of the questions I had, NaN values.
As we learned in the How does a neural net really work notebook, we going to want to multiply each column by some coefficients. But we can see in the Cabin column that there are NaN values, which is how Pandas refers to missing values. We can’t multiply something by a missing value!
Let’s check which columns contain NaN values. Pandas’ isna() function returns True (which is treated as 1 when used as a number) for NaN values, so we can just add them up for each column:
I found this info on the titanic notebook, I am assuming that it means that there is a problem with the built dataset.
I guess I was just thinking about how to leverage some of the ‘ask questions of your documents’ NLP use cases, and wondering if aggressively fine-tuning on your dataset could make a kind of useful search engine for some specific set of documents?
thats an interesting thought. I would assume that if we have a use-case where we have no need of “new” documents, then DL doesn’t apply as there is no need to “learn” anything. If we want a search engine, we could simply index it like a crawler.
I think we could overfit two different documents to compare how similar they are, but even here I think a simpler approach would do.
It could make an interesting experiment to check how much we need to overfit to see if the model “memorises” the document. Maybe this could be used for a plagiarism detector? I could be way off here, just typing out my mind.
The main reason I’d want the DL part would be to enable a smarter kind of searching and querying of the documents. I wouldn’t just want keyword search, but rather some way to interrogate and ask questions of the fixed set of documents. Ie there would be things that perhaps weren’t named, or concepts expressed that you could understand but that weren’t named etc. I was wondering if this kind of approach ever gets used.
I made a separate thread for this, but perhaps it’s best discussed here.
Jeremy mentioned that a CSV file can work well for smaller datasets, but for larger datasets you’ll be working with text files in a folder. This reminds me of the IMDB dataset from an earlier Fast.ai course. As such, I’ve been trying to figure out how I can follow along with this notebook, but using text files in a folder instead, like how the IMDB dataset works:
I can’t seem to figure it out. I haven’t found any examples of this working with the Huggingface library that’s used in the above notebook. I found a Fast.ai/Huggingface integration called blurr which has an example (it doesn’t work out of the box, but I made some adjustments to get it running), but even when I only load a few text files my 16GB GPU runs out of memory:
Does anyone know of a way to use text files in a folder as a data source for the NLP training Jeremy showed us in Lesson 4?
I used the fastai from_folder methods for that. You can either use the TextBlock version or the TextDataLoaders version. The docs have examples of how they work.
For a way that works with HuggingFace library, check out my blog where I include code for how to process and access a folder of text files (split by category names and NOT by “train”/“test” split. Note that I defined a function to be able to access the labels, similar to how we did it in lesson 1.
Yes, I love the from_folder methods in Fast.ai for this. They allow you to use datasets bigger than the RAM on your machine, as it only stores pointers to the text files. This allows you to train 40GB of text files, on a 16GB RAM (system RAM not GPU RAM) laptop for example, because it doesn’t load everything into memory all at once.
Your blog post (great work btw!) seems to read in all of the text files, and creates a dataframe out of them. You’d need to have at least as much RAM as the dataset takes up, and in my testing, usually a bit more. Have you seen anything like the from_folder methods, but with the HuggingFace library?
Maybe this part of the HF Datasets docs might help? Stream
It seems the key property you’re after is having some kind of streaming / lazily loaded generator which gives you parts of the dataset, one after another. Perhaps this is the way to do it with HF?
The GPU problems you mentioned aren’t related to where the items are stored, but due to your choice of model and batch size. There’s another topic in this category already that discusses solutions to that issue. (Sorry don’t have time to find you the link right now)
We use data augmentation while working with images to increase accuracy and reduce overfitting. For text, are any techniques or algorithms to augment the input text?
maybe I misunderstood something here. The notebook talks about Text Classification, the model is called AutoModelForSequenceClassification, but the predictions (clipped) sounds to be a number from 0 to 1. Isn’t this a regression?
EDIT: nevermind, I was missing the explication of this in the end of the video, the key here as far I understand is num_labels=1 that actually turns it to a regression problem.
Linear Model and Neural Network from Scratch Challenge
The big challenge is converting Jeremy’s spreadsheet to Python code independently.
You may attend or watch session #4. Thus, you know how Jeremy recreates it step by step with Python, Pandas, and Numpy. That is OK, but for the challenge, you can’t use the “Linear Model and Neural Network from Scratch” Jupyter Notebook on Kaggle.
If Jeremy can do it on Exel, surely you (and I) can do it with Python. The below layout is a step-by-step challenge. Re-watching session #3 is allowed (but not session #4). It is OK to use Stackoverflow and any resource on the Net.
WHY?
If you can do this, then you are Top Gun, i.e., you understand the core concepts of Neural networks, aka Deep Learning.
The goal is NOT to write the most compact and elegant code, but it is for YOU to understand how to code it.
…and because it is a fun brain teaser.
I am interested in your posted solution (I think we all do). I am almost done, full of Pandas and graphs everywhere.