Lesson 4 Advanced Discussion ✅


(vibhor sood) #58

. Belated happy birthday Jeremy :smile:


#60

They’re just features, no need for the word identifying haha besides the fact that we know them about businesses, as in you can look them up. And they determine to a large extent cash management needs (which allows one to train the regressor that maps from these features into embedding dimension well).


(Paul M) #61

I’m very interested in this subject. I think that it has close analogues to applications in drug discovery that I’ve been contemplating since I saw Jeremy first explain collaborative filtering almost two years ago. I have a couple of minor questions. In what follows m1 is your recommender, m2 is your regression to the embedded space. 1) Have you tried this idea with a (somewhat) deep neural net for m1? I’d imagine that a bit of nonlinearity wouldn’t change anything, but I’d be curious to know if you have data or a reason to think that this two-step cold-start solution wouldn’t work in that case. 2) What can you do if the m2 prediction is very uncertain and a typical validation of the m1-m2 cycle wouldn’t work very well? Is there a way to use the density of the current embeddings from m1 to help localize or nudge the predicted embedding from m2? 3) from a practical perspective, you’d have to train both models again every time you get new m1 data. Could it be that there exists a better collaborative filtering model to be had by somehow including the data from m2 into the process for creating and training m1? Intuitively it feels that these two are a single coupled problem, so one might as well train m1 and m2 in a coupled way on the available data, but I might be totally wrong here.


#62

@paul my thoughts:

  1. Neural network is fine! You can either do (user,movie) -> (user_embed, movie_embed) -> dot product -> out
    or (user,movie) -> (user_embed, movie_embed) -> [concat] -> linear layer -> relu -> … -> out
    Either way, the training data for m2 will be [user_features] -> [user_embed] and if you get that working well enough, you can have the rest of your m1 model take it from there.

  2. If you can’t situate a new user into embedding dimension, you cannot use m1. You could try the following if m2 is not working: apply some clustering technique to existing users in embedding dimension, find some number of clusters you are happy with and instead of training m2 as a regressor try as a classifier and assign new users the cluster mean, for example. Generally, though if m2 is not working it means the features you are trying to use are bad and you can try to think about getting better features (in my case things like sales volume and type of business do end up determining to a pretty good extent what a customers cash-management needs are, so we find m2 works out okay). Not sure if this is what you mean or answers your question.

  3. You definitely can include the m2 features when training m1 and use (say) fast.ai columnar data module (which takes all the categorical variables like user, movie into an embed dimension anyway) and will probably get a better model, but you will still have a cold-start problem. Now we take a new user and we still have to situate them in the embedding dimension to use your model. One approach is say we have a new user and we know some of the additional features (say the business is a small retail business, sales volume is 300-500k, and 30-50 employees), we can look at the sample of known users in embedding space that have these same feature values, try to approximate that distribution and sample from there (or just use the sample mean) but how to do this process if your additional features are continuous? In the end, you will do have to find some map that takes the continuous features into embed space to use augmented m1. So yeah, you can use m2 features to make m1 stronger, but you still need some notion of m2 to welcome a new user.

As for retraining m2 every time you add or remove data and retrain m1-- maybe you have to and maybe you don’t. Whatever users you keep from m1(before) to m1(after) you can see (i) how much they shifted in embedding space and (ii) how much m2 error has increased after the shift. Could be you’re content with the change.


(Daniel Armstrong ) #63

Do you have any code examples of how you would do this? Is this done when you are defining the layers? Like with nn.sequential?


(Paul M) #64

Thanks very much for the prompt and clear response. I’ll have to think a little more about question 3 and your answer before I can say something useful. My intuition tells me that there might exist some other ways to train m1 and m2 in a way that makes it more likely for m2 to provide embeddings that work in m1, and that even if such a solution results in somewhat weaker models for m1 in isolation it might work better for generalizing in the intended m2->m1 predictions.


(Kaspar Lund) #65

Hi i have found a translations dataset from EU about 100-150 mb of pure text pr language.
Would that be enough to train a language mode like the 103-wikitext from scratch.
Also would that be a feasable endavour at this stage ?


(Chris Oosthuizen) #66

Where can we find add_datepart that was part of fastai v0.7 to split the dates into years months days etc


(Lankinen) #67

I created this separate thread for Reinforcement Learning so if someone knows something about it could she/he answer my questions? Thank you.


(Bharadwaj Srigiriraju) #68

Jeremy mentioned in the lecture that processors are sorta like a one-time transforms over the dataset (tokenizing and numericalize etc) that is done before any training begins…

Was wondering if something like this exists in ImageDatabunch objects as well? Like if I want to apply a particular transformation on my data ahead of time instead of augmenting dynamically while the training is going on?

Problem: I have a bunch of images which I am trying to classify and during training, it seems like transforms are taking way too long while GPU sits idle… this is happening even after experimenting with different batch sizes. Very small batch sizes (like 4 and 8) are not giving as good results as bigger bs values seem to do (32, 64 etc.).

Can we do some form of preprocessing/ precompute of the transforms on my data so that the tranform part running on CPU doesn’t become my bottleneck? I remember fastai v0.7 having something like this, but I am not sure about this version.

  • I have tried removing transforms completely but it starts throwing dimension mismatch exceptions once I do that.
  • I took inspiration from imagenet-fast source and resized ground truth images beforehand to a smaller dimensions and this reduced my transform time a bit, but CPU is still a bottleneck for me nevertheless. Not sure how I can improve my GPU utilization here.

(Arkid) #70

Has anyone used “activation maps” kind of concept to tabular data? A way in which we can find out which variable in the table was responsible for activating a class. That can be super interesting for interpretability which a lot of businesses look for.


(Joseph Catanzarite) #71

Not sure if you want something more advanced. But if you want something very basic, try https://playground.tensorflow.org/


(Jeremy Howard (Admin)) #72

Yes exactly.


(Jeremy Howard (Admin)) #73

No, our ULMFiT algorithm easily beat all 1D-CNN classifiers. The paper you link to is not current.


(Jeremy Howard (Admin)) #74

Be careful not to confuse output activation functions with between-layer functions. The thing I drew was for output - where we certainly don’t want N(0,1).

As we discussion, for between-layer activation functions we generally use ReLU. And we let batchnorm handle the distribution for us.


(Jeremy Howard (Admin)) #75

Yes, normally I add a little to the max and min for this reason.


(Jeremy Howard (Admin)) #76

Yes, you can add a preprocessor to any ItemList. We’ve only just added this functionality, so if you’re interested in trying it, please open a topic with some details and sample code on #fastai-users and at-mention @sgugger and I.


(Scott H Hawley) #77

Thanks for your response @jeremy. However, the talk of “target values” means that they were writing precisely about output activation layers, not between-layer functions.

Let me rephrase my question: Looks like they’re saying to scale the output activation function so it extends beyond the target output values (0 to 1, or 0 to 5). Is that something you recommend as well?


(Jeremy Howard (Admin)) #78

Yes I mentioned that here:


(chandan) #79

Is it possible to learn word or sentence embedding using lamguage model trained on wiki?

i was thinking about language model and how it was able to predict next word.Now idea that struck me was will it be possible to get a score for sentence out of model for use in sentence comparison.

ideally
sentence[w1…wn] ->language model-> wn+1
and
sentence[w1…wn] ->language model-> classifier+sigmoid ->0,1

could it be something like
sentence[w1…wn] ->language model-> +??? -> sentence representation[1212,1521515,0212,451]

I know this is advanced topic and i found discussion going on in advanced forum on same here but just wanted to ask is it worth pursuing?