FitLaM: What I've been working on recently

jeremy · January 24, 2018, 5:46am

It’s all there in the repo now. fastai.text is the new processing framework, or just use torchtext. Lesson 4 takes you through a complete example (IMDb).

taposh · January 24, 2018, 6:38am

Thanks Jeremy !! You rock !!

maddogS · January 25, 2018, 5:52am

You da best!

jellis11 · January 30, 2018, 10:53pm

I have a question about how to set up the embedding matrix for the fine tuned task. Since the embedding matrix needs to be based on the same vocabulary (I believe) as the data used to train the network, how does one deal with new words in the data set used for fine tuning?

jeremy · January 30, 2018, 10:56pm

Personally, I set them to the average of all the embedding vectors.

surmenok · February 4, 2018, 6:53pm

@Jeremy, awesome paper!
Lesson 4 code looks very similar to the algorithm described in the paper, but some steps (e.g. pre-training on Wikitext-103 dataset) were not used in the lesson. The paper mentions things like gradual unfreezing of the language model layer by layer, warm-up reverse annealing. Some of these tricks were not used in the lesson 4 (or I couldn’t find them in the code).
Do you plan to share the code that can be used to reproduce results described in the paper?

surmenok · February 5, 2018, 3:46am

Is it the code in text.py file?
text.py and nlp.py have some code duplication. Which one should be used as a primary now?

jeremy · February 5, 2018, 3:58am

Yup fastai.text is text.py. I’m hoping to replace fastai.nlp with fastai.text by the time we teach part 2 And to have a walkthrough of all the tricks…

karanchahal · February 6, 2018, 6:23am

Hey @jeremy ,
I want to fine tune a neural translation model (seq2seq) using a pretrained langauge model.
But the vocabularies of both the datasets are not as similiar as I would like.
Should I train both the models using a combination of the vocabularies (adding both the vocabs together),
or train a character level model ?

What are your suggestions upon encountered these issues (dealing with different vocab sizes) while finetuning nlp models ?

Even · February 7, 2018, 10:01pm

Very excited for that.

himanshu · February 21, 2018, 7:00am

Hi @jeremy,
Do you have any example of Concat pooling from the FitLaM paper? Is it available in the videos?

radek · February 21, 2018, 7:32am

This is the relevant code:

160 class PoolingLinearClassifier(nn.Module):                                                                                                                                                                      
  1     def __init__(self, layers, drops):                                                                                                                                                                         
  2         super().__init__()                                                                                                                                                                                     
  3         self.layers = nn.ModuleList([                                                                                                                                                                          
  4             LinearBlock(layers[i], layers[i + 1], drops[i]) for i in range(len(layers) - 1)])                                                                                                                  
  5                                                                                                                                                                                                                
  6     def pool(self, x, bs, is_max):                                                                                                                                                                             
  7         f = F.adaptive_max_pool1d if is_max else F.adaptive_avg_pool1d                                                                                                                                         
  8         return f(x.permute(1,2,0), (1,)).view(bs,-1)                                                                                                                                                           
  9                                                                                                                                                                                                                
 10     def forward(self, input):                                                                                                                                                                                  
 11         raw_outputs, outputs = input                                                                                                                                                                           
 12         output = outputs[-1]                                                                                                                                                                                   
 13         sl,bs,_ = output.size()                                                                                                                                                                                
 14         avgpool = self.pool(output, bs, False)                                                                                                                                                                 
 15         mxpool = self.pool(output, bs, True)                                                                                                                                                                   
 16         x = torch.cat([output[-1], mxpool, avgpool], 1)                                                                                                                                                        
 17         for l in self.layers:                                                                                                                                                                                  
 18             l_x = l(x)                                                                                                                                                                                         
 19             x = F.relu(l_x)                                                                                                                                                                                    
 20         return l_x, raw_outputs, outputs

It lives in lm_rnn.py.

The idea is very elegant. Say you have an RNN with bptt of 10. At each step a hidden state will be generated with the last one being the final, the output. Each hidden state is a vector of length n. We take the output of shape (1, n), take the avg across all ten hidden states for items with the same index and obtain another vector of shape (1, n), do a similar operation for max across the indexes. As a result we have 3 vectors of shape (1, n). All we do then is we concatenate them together to get a vector of shape (1, 3n).

This is my best understanding but it might be wrong - I haven’t gotten around to experimenting with the model yet.

himanshu · February 21, 2018, 7:39am

Thanks @radek for the prompt reply and good explanation.

codeck · February 23, 2018, 10:21am

Hey @jeremy just wondering about the pre-processing stage, how do we pre-process the docs ?

Deb · March 11, 2018, 9:57pm

Looing forward to try fastai.text! My experience with torchrext has been slightly bitter due to its overall sequential tokenization. That makes it slow and memory inefficient. To workaround it I had to play few tricks. May be I’ll post a thread on that for comments sometime.

jeremy · March 14, 2018, 2:36pm

Me too! Hence fastai.text

yonatanMedan · April 10, 2018, 7:15am

looking at the code, it looks like the outputs are the outputs of the rnn and not the hidden states of the rnn.
not like in the paper.

relevent code from fast ai lm_rnn.py (the rnn_encoder forward mathod):

def forward(self, input):
    """ Invoked during the forward propagation of the RNN_Encoder module.
    Args:
        input (Tensor): input of shape (sentence length x batch_size)
    Returns:
        raw_outputs (tuple(list (Tensor), list(Tensor)): list of tensors evaluated from each RNN layer without using
        dropouth, list of tensors evaluated from each RNN layer using dropouth,
    """
    sl,bs = input.size()
    if bs!=self.bs:
        self.bs=bs
        self.reset()

    emb = self.encoder_with_dropout(input, dropout=self.dropoute if self.training else 0)
    emb = self.dropouti(emb)

    raw_output = emb
    new_hidden,raw_outputs,outputs = [],[],[]
    for l, (rnn,drop) in enumerate(zip(self.rnns, self.dropouths)):
        current_input = raw_output
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            raw_output, new_h = rnn(raw_output, self.hidden[l])
        new_hidden.append(new_h)
        raw_outputs.append(raw_output)
        if l != self.nlayers - 1: raw_output = drop(raw_output)
        outputs.append(raw_output)

    self.hidden = repackage_var(new_hidden)
    return raw_outputs, outputs

am i correct or am i missing something?

radek · April 10, 2018, 8:21am

I am not sure but I suspect that the issue might be here that the naming gets overloaded. The RNN produces some output for each time step. We can treat it as a black box that just gives us the output vector. Inside the black box many things might happen (including it having multiple layers) and it might be producing some activations that might be referred to as its ‘hidden state’.

I was referring to the ‘hidden state’ on a more macro level, as in hidden state of the entire model being what is produced at each time step by the Encoder. At each time step we get some vector of length n and we can stack them together to get something of the shape (<num_time_steps>, ). The pooling layer then tries to figure out what to do with this information. The simplest approach would be to just grab the last RNN output and call it a day. But this is problematic because some information that might be useful will escape us and also gradient propagation and remembering information from many time steps back is not ideal in any of RNN archs (lstm / gru etc). So here we are doing something quite smart - we are grabbing the last hidden state and also taking the max and mean of each activation in n across time steps.

yonatanMedan · April 10, 2018, 8:58am

wow thanks for the very quick response!

jeremy · April 10, 2018, 4:59pm

Check out lesson 6, where you’ll learn that the outputs of an RNN are it’s hidden states