Lesson 4 Advanced Discussion ✅

@paul my thoughts:

  1. Neural network is fine! You can either do (user,movie) -> (user_embed, movie_embed) -> dot product -> out
    or (user,movie) -> (user_embed, movie_embed) -> [concat] -> linear layer -> relu -> … -> out
    Either way, the training data for m2 will be [user_features] -> [user_embed] and if you get that working well enough, you can have the rest of your m1 model take it from there.

  2. If you can’t situate a new user into embedding dimension, you cannot use m1. You could try the following if m2 is not working: apply some clustering technique to existing users in embedding dimension, find some number of clusters you are happy with and instead of training m2 as a regressor try as a classifier and assign new users the cluster mean, for example. Generally, though if m2 is not working it means the features you are trying to use are bad and you can try to think about getting better features (in my case things like sales volume and type of business do end up determining to a pretty good extent what a customers cash-management needs are, so we find m2 works out okay). Not sure if this is what you mean or answers your question.

  3. You definitely can include the m2 features when training m1 and use (say) fast.ai columnar data module (which takes all the categorical variables like user, movie into an embed dimension anyway) and will probably get a better model, but you will still have a cold-start problem. Now we take a new user and we still have to situate them in the embedding dimension to use your model. One approach is say we have a new user and we know some of the additional features (say the business is a small retail business, sales volume is 300-500k, and 30-50 employees), we can look at the sample of known users in embedding space that have these same feature values, try to approximate that distribution and sample from there (or just use the sample mean) but how to do this process if your additional features are continuous? In the end, you will do have to find some map that takes the continuous features into embed space to use augmented m1. So yeah, you can use m2 features to make m1 stronger, but you still need some notion of m2 to welcome a new user.

As for retraining m2 every time you add or remove data and retrain m1-- maybe you have to and maybe you don’t. Whatever users you keep from m1(before) to m1(after) you can see (i) how much they shifted in embedding space and (ii) how much m2 error has increased after the shift. Could be you’re content with the change.

3 Likes

Do you have any code examples of how you would do this? Is this done when you are defining the layers? Like with nn.sequential?

Thanks very much for the prompt and clear response. I’ll have to think a little more about question 3 and your answer before I can say something useful. My intuition tells me that there might exist some other ways to train m1 and m2 in a way that makes it more likely for m2 to provide embeddings that work in m1, and that even if such a solution results in somewhat weaker models for m1 in isolation it might work better for generalizing in the intended m2->m1 predictions.

Hi i have found a translations dataset from EU about 100-150 mb of pure text pr language.
Would that be enough to train a language mode like the 103-wikitext from scratch.
Also would that be a feasable endavour at this stage ?

Where can we find add_datepart that was part of fastai v0.7 to split the dates into years months days etc

I created this separate thread for Reinforcement Learning so if someone knows something about it could she/he answer my questions? Thank you.

2 Likes

Jeremy mentioned in the lecture that processors are sorta like a one-time transforms over the dataset (tokenizing and numericalize etc) that is done before any training begins…

Was wondering if something like this exists in ImageDatabunch objects as well? Like if I want to apply a particular transformation on my data ahead of time instead of augmenting dynamically while the training is going on?

Problem: I have a bunch of images which I am trying to classify and during training, it seems like transforms are taking way too long while GPU sits idle… this is happening even after experimenting with different batch sizes. Very small batch sizes (like 4 and 8) are not giving as good results as bigger bs values seem to do (32, 64 etc.).

Can we do some form of preprocessing/ precompute of the transforms on my data so that the tranform part running on CPU doesn’t become my bottleneck? I remember fastai v0.7 having something like this, but I am not sure about this version.

  • I have tried removing transforms completely but it starts throwing dimension mismatch exceptions once I do that.
  • I took inspiration from imagenet-fast source and resized ground truth images beforehand to a smaller dimensions and this reduced my transform time a bit, but CPU is still a bottleneck for me nevertheless. Not sure how I can improve my GPU utilization here.
2 Likes

Has anyone used “activation maps” kind of concept to tabular data? A way in which we can find out which variable in the table was responsible for activating a class. That can be super interesting for interpretability which a lot of businesses look for.

6 Likes

Not sure if you want something more advanced. But if you want something very basic, try https://playground.tensorflow.org/

Yes exactly.

4 Likes

No, our ULMFiT algorithm easily beat all 1D-CNN classifiers. The paper you link to is not current.

4 Likes

Be careful not to confuse output activation functions with between-layer functions. The thing I drew was for output - where we certainly don’t want N(0,1).

As we discussion, for between-layer activation functions we generally use ReLU. And we let batchnorm handle the distribution for us.

1 Like

Yes, normally I add a little to the max and min for this reason.

5 Likes

Yes, you can add a preprocessor to any ItemList. We’ve only just added this functionality, so if you’re interested in trying it, please open a topic with some details and sample code on #fastai-users and at-mention @sgugger and I.

2 Likes

Thanks for your response @jeremy. However, the talk of “target values” means that they were writing precisely about output activation layers, not between-layer functions.

Let me rephrase my question: Looks like they’re saying to scale the output activation function so it extends beyond the target output values (0 to 1, or 0 to 5). Is that something you recommend as well?

Yes I mentioned that here:

1 Like

Is it possible to learn word or sentence embedding using lamguage model trained on wiki?

i was thinking about language model and how it was able to predict next word.Now idea that struck me was will it be possible to get a score for sentence out of model for use in sentence comparison.

ideally
sentence[w1…wn] ->language model-> wn+1
and
sentence[w1…wn] ->language model-> classifier+sigmoid ->0,1

could it be something like
sentence[w1…wn] ->language model-> +??? -> sentence representation[1212,1521515,0212,451]

I know this is advanced topic and i found discussion going on in advanced forum on same here but just wanted to ask is it worth pursuing?

Is the label_delim parameter in TextDataBunch functional? I get an error trying

data_clas = TextDataBunch.from_df(path, train_df=df_trn, valid_df=df_val, 
                                  vocab=data_lm.vocab, 
                                  text_cols='Narrative', 
                                  label_cols='Contributing Factors / Situations',
                                  label_delim='|',
                                  bs=bs)

Error: iterator should return strings, not float (did you open the file in text mode?)
I also get an error when the delimiter is two character, e.g. '; ’

1 Like

I’m wondering about this myself. I’m setting batch size values as I load my data_lm previously created with default values but the GPU load seems to remain the same no matter what.

1 Like

Yes, normally I add a little to the max and min for this reason.

Do you refer to the + self.min_score after passing the result through a sigmoid or to something else done afterwards ? Also I can understand that adding something helps the problem for values close to 5 (as the sigmoid will likely reduce them) but why is it the case for values close to 0 ? In my understanding, we should subtract something, shouldn’t we ?