FitLaM: What I've been working on recently

(Jeremy Howard (Admin)) #22

It’s all there in the repo now. fastai.text is the new processing framework, or just use torchtext. Lesson 4 takes you through a complete example (IMDb).

(Roy) #23

Thanks Jeremy !! You rock !!

(maddogS) #24

You da best!

(Justin Ellis) #25

I have a question about how to set up the embedding matrix for the fine tuned task. Since the embedding matrix needs to be based on the same vocabulary (I believe) as the data used to train the network, how does one deal with new words in the data set used for fine tuning?

(Jeremy Howard (Admin)) #26

Personally, I set them to the average of all the embedding vectors.

(Pavel Surmenok) #27

@Jeremy, awesome paper!
Lesson 4 code looks very similar to the algorithm described in the paper, but some steps (e.g. pre-training on Wikitext-103 dataset) were not used in the lesson. The paper mentions things like gradual unfreezing of the language model layer by layer, warm-up reverse annealing. Some of these tricks were not used in the lesson 4 (or I couldn’t find them in the code).
Do you plan to share the code that can be used to reproduce results described in the paper?

(Pavel Surmenok) #28

Is it the code in file? and have some code duplication. Which one should be used as a primary now?

(Jeremy Howard (Admin)) #29

Yup fastai.text is I’m hoping to replace fastai.nlp with fastai.text by the time we teach part 2 :slight_smile: And to have a walkthrough of all the tricks…

(Karanbir) #30

Hey @jeremy ,
I want to fine tune a neural translation model (seq2seq) using a pretrained langauge model.
But the vocabularies of both the datasets are not as similiar as I would like.
Should I train both the models using a combination of the vocabularies (adding both the vocabs together),
or train a character level model ?

What are your suggestions upon encountered these issues (dealing with different vocab sizes) while finetuning nlp models ?

(Even Oldridge) #31

Very excited for that. :slight_smile:

(Himanshu) #32

Hi @jeremy,
Do you have any example of Concat pooling from the FitLaM paper? Is it available in the videos?


This is the relevant code:

160 class PoolingLinearClassifier(nn.Module):                                                                                                                                                                      
  1     def __init__(self, layers, drops):                                                                                                                                                                         
  2         super().__init__()                                                                                                                                                                                     
  3         self.layers = nn.ModuleList([                                                                                                                                                                          
  4             LinearBlock(layers[i], layers[i + 1], drops[i]) for i in range(len(layers) - 1)])                                                                                                                  
  6     def pool(self, x, bs, is_max):                                                                                                                                                                             
  7         f = F.adaptive_max_pool1d if is_max else F.adaptive_avg_pool1d                                                                                                                                         
  8         return f(x.permute(1,2,0), (1,)).view(bs,-1)                                                                                                                                                           
 10     def forward(self, input):                                                                                                                                                                                  
 11         raw_outputs, outputs = input                                                                                                                                                                           
 12         output = outputs[-1]                                                                                                                                                                                   
 13         sl,bs,_ = output.size()                                                                                                                                                                                
 14         avgpool = self.pool(output, bs, False)                                                                                                                                                                 
 15         mxpool = self.pool(output, bs, True)                                                                                                                                                                   
 16         x =[output[-1], mxpool, avgpool], 1)                                                                                                                                                        
 17         for l in self.layers:                                                                                                                                                                                  
 18             l_x = l(x)                                                                                                                                                                                         
 19             x = F.relu(l_x)                                                                                                                                                                                    
 20         return l_x, raw_outputs, outputs        

It lives in

The idea is very elegant. Say you have an RNN with bptt of 10. At each step a hidden state will be generated with the last one being the final, the output. Each hidden state is a vector of length n. We take the output of shape (1, n), take the avg across all ten hidden states for items with the same index and obtain another vector of shape (1, n), do a similar operation for max across the indexes. As a result we have 3 vectors of shape (1, n). All we do then is we concatenate them together to get a vector of shape (1, 3n).

This is my best understanding but it might be wrong - I haven’t gotten around to experimenting with the model yet.

(Himanshu) #34

Thanks @radek for the prompt reply and good explanation.

(Kousik) #35

Hey @jeremy just wondering about the pre-processing stage, how do we pre-process the docs ?

(Debashish Panigrahi) #36

Looing forward to try fastai.text! My experience with torchrext has been slightly bitter due to its overall sequential tokenization. That makes it slow and memory inefficient. To workaround it I had to play few tricks. May be I’ll post a thread on that for comments sometime.

(Jeremy Howard (Admin)) #37

Me too! Hence fastai.text :slight_smile:

Questions on torchtext and padding as a regularizer
(יונתן מדן) #38

looking at the code, it looks like the outputs are the outputs of the rnn and not the hidden states of the rnn.
not like in the paper.

relevent code from fast ai (the rnn_encoder forward mathod):

def forward(self, input):
    """ Invoked during the forward propagation of the RNN_Encoder module.
        input (Tensor): input of shape (sentence length x batch_size)
        raw_outputs (tuple(list (Tensor), list(Tensor)): list of tensors evaluated from each RNN layer without using
        dropouth, list of tensors evaluated from each RNN layer using dropouth,
    sl,bs = input.size()
    if bs!

    emb = self.encoder_with_dropout(input, dropout=self.dropoute if else 0)
    emb = self.dropouti(emb)

    raw_output = emb
    new_hidden,raw_outputs,outputs = [],[],[]
    for l, (rnn,drop) in enumerate(zip(self.rnns, self.dropouths)):
        current_input = raw_output
        with warnings.catch_warnings():
            raw_output, new_h = rnn(raw_output, self.hidden[l])
        if l != self.nlayers - 1: raw_output = drop(raw_output)

    self.hidden = repackage_var(new_hidden)
    return raw_outputs, outputs

am i correct or am i missing something?


I am not sure but I suspect that the issue might be here that the naming gets overloaded. The RNN produces some output for each time step. We can treat it as a black box that just gives us the output vector. Inside the black box many things might happen (including it having multiple layers) and it might be producing some activations that might be referred to as its ‘hidden state’.

I was referring to the ‘hidden state’ on a more macro level, as in hidden state of the entire model being what is produced at each time step by the Encoder. At each time step we get some vector of length n and we can stack them together to get something of the shape (<num_time_steps>, ). The pooling layer then tries to figure out what to do with this information. The simplest approach would be to just grab the last RNN output and call it a day. But this is problematic because some information that might be useful will escape us and also gradient propagation and remembering information from many time steps back is not ideal in any of RNN archs (lstm / gru etc). So here we are doing something quite smart - we are grabbing the last hidden state and also taking the max and mean of each activation in n across time steps.

(יונתן מדן) #40

wow thanks for the very quick response!

(Jeremy Howard (Admin)) #41

Check out lesson 6, where you’ll learn that the outputs of an RNN are it’s hidden states :slight_smile: