Lesson 4 In-Class Discussion

@vikbehal
I might not be correct with this one,but giving it a shot…(I cannot try because having endsems ending on next week)

What we can do is get some random rows from the dataset, create a new data frame and thus a new CSV file from the training set…
Or,
Passing the test set itself as the validation one…
validation=test

And after that for the final round,
Just train your whole model on the training one plus the test set for the last tweak to hyperparameters by joining them/concating…

It’s like when Jeremy said that before submitting, he trained the whole model on all the data that’s available…

Can someone confirm?

Many academic datasets only contain a train and a test, no separate validation. In these cases, researchers often treat the test set as if it were a validation set - although obviously that’s far from ideal!

Thank you for clarifying. I was creating a split for one of the ongoing problem. For this I was referring to:

  1. https://github.com/pytorch/text/blob/master/torchtext/datasets/imdb.py#L9:13
  2. https://github.com/fastai/fastai/blob/master/courses/dl1/lang_model-arxiv.ipynb

After exploration, it seems both files use labelled data even for the test set. Is my understanding correct?

cced: @Moody @KevinB

Yes exactly. Kaggle comps are a bit unusual - in practice you’ll always have labels for your test set, you should just avoid using them, until right at the end of your project!

1 Like

So, I was using similar approach to IMDb or arxiv to create splits. The constructor is gonna label items in test folder as well - due to the loop which goes through labels and add label and text to property to all of the examples.

What I’ve tried so far is putting all items from test in a folder say ‘all’, doing so will assign a dummy label to test data. So my split will get items but for some reason I get 4 probabilities instead of 3 in the predict. Such probabilities doesn’t add to 1.

To keep test examples without label, Should I be extending constructor to check if it’s test folder, if so, just add text and not label property?

@jeremy, could you please advise?

@rsrivastava @Moody! Probably this is where our(my) approach is failing? The splits if created per Arxiv or IMDB isn’t smooth for test data. Any thoughts?

@jeremy Similar to the previous question, something that has made it a bit challenging to truly grasp the last two lessons is the concept of input from Rank1 Tensor versus an input as a matrix. In Lesson 4 you say “a fully connected network takes a rank1 Tensor passes it to a linear layer” then to an activation layer (Relu) etc.,

  1. If the input is an image, and an image is represented as a matrix, but the model requires a vector as opposed to a matrix, how can the image be treated as a vector (Rank 1 Tensor) before being passed to the model? Are you multiplying the matrix to get a vector of the image then using that vector to classify the image? we didn’t use this technique in previous image classifications (dog breed/cats vs dogs) – Is this just specific to fully connected networks?

thanks!

Or is it to say that since Fully connected networks are mostly meant to deal with Structure Data or Time series data – where there are continuous and categorical variables – an image wouldn’t make sense in this type of model?

If that’s the case, in the Rossman store example, would each vector provided as input be treated as attributes of a single store? For example suppose our Rossman data model only has 10 continuous variables and we only have data for ten stores – would we then pass 10 vectors of size 10 into the fully connected network when making a prediction?

Exactly, for images we use convolutions, not fully connected layers, generally speaking.

I generally don’t think of Rank 1 Tensor, but more as how many dimensions does the inputs have. If it’s Image, it’s typically Channels x Height x Width (Or Rank 3 Tensor - or 3 Dimensions). This is the input for a Convolutional Network.

When you need to pass to a Dense (Fully Connected or Linear) Layer, you need to “Flatten” this Rank 3 Tensor to a Rank 1 (Batch x features) tensor. You can think of it as a Tabular Feature data like Pandas. PyTorch way of flattening a Tensor (from Rank 3 to 1) is called view. Because you re projecting the Rank 3 tensor into Rank 1 or any other shape you want. It’s really also not making a copy, So view might make sense than flatten (keras). But they are equivalent when thinking of Image CNN -> Flatten -> Fully Connected Layer -> Softmax. This is not the only architecture for images, but one way to setup.

1 Like

What’s the intuition behind using a language model to improve a text classifier? And how is that different from using word vectors like glove and word2vec?

I know that it was somewhat answered in lesson 4, but i still find it confusing.

It’s exactly the same intuition as using a pre-trained imagenet model to improve an image classifier. Word vectors, however, are the result of just the first layer of a model (effectively).

1 Like

Thanks Jeremy. Am currently working on negative reviews classifier problem without having a large dataset to train with.

Going by that intuition, instead of pre-training a “predict the next word” model and later connecting a classifier to it - would it better if I were to build a text classifier based on a big dataset (like ones from Project Detox, since the text and problem is more relevant), and connecting that to another text classifier (which will be trained on my small dataset)?

Yes that might be better - or better still, pre-train a language model, then fine tune that on Project Detox et al, then fine tune on your dataset.

1 Like

Just noting that the ACL IBDM file linked in the notebook (http://files.fast.ai/data/aclImdb.tgz) isn’t actually a .tgz file - it’s just a straight .tar

I noticed this too. I used the command

$ tar -xvf aclImdb.tgz

to unzip it in case anyone has trouble.

You’ll also need to create a “models” sub directory in the data/aclImdb/ directory

What’s the best way to save and reload the prepared and tokenized LanguageModelData (because it’s pretty slow)?

I see:

pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

but I can’t see how to reload that into the LanguageModelData, My first thought was to pickle LanguageModelData itself, but that is a generator.

load_model = pickle.load(open(filename, 'rb'))

No, that doesn’t work:

My first thought was to pickle LanguageModelData itself, but that is a generator.

To expand on that a bit - you can’t pickle generators, unless I’m missing something:

pickle.dump(md, open(f'{PATH}models/lang_model.pkl','wb'))

gives:

TypeError: can't pickle generator objects