Use pretrained AWD_LSTM for caption generation

I am trying to implement the “Show, Attend and Tell” paper. I thinking about the possibility of using the AWD_LSTM pre-trained language model for caption generation.

Firstly, I want to know whether my approach for using pre-trained model is correct, here is the code:

## language data bunch
data_lm = (TextList.from_df(df=metadata,path='.',cols='labels')

# create learner object
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3,pretrained=True)

# fine tune model
learn.fit_one_cycle(8, 1e-2, moms=(0.8,0.7))

# save model parameters'fine_tuned_LM')

## load model parameters
pretrained_lstm = AWD_LSTM(vocab_size, emb_sz=812, n_hid = 512,n_layers= 1)

wgts = torch.load('models/fine_tuned_LM.pth')
params = list(zip(wgts.items(),pretrained_lstm.state_dict().items()))

for p in params:
    name = p[1][0]
    pretrained_lstm.state_dict()[name] = p[0][1]

I have a few questions on this:

  1. I have to pass a modified hidden state, the output of the attention model. How do I do that?

  2. Since I using a pre-trained language model, it supposes to store vocabulary (word to index mapping). How this vocabulary information is stored.

  3. It is enough to load just ‘fine_tuned_LM_Encoder’ or I should load entire ‘fine_tuned_LM’. In IDMB sentimental analysis tutorial, it was used for classification but for my case goal is text generation. I believe the entire model should be loaded.

  4. Since I used AWD_LSTM loaded with wiki-103 weights, I had to same architecture but when I initiated with different hyperparameters (emb_sz=812, n_hid = 512) and loaded my fine-tuned weights it did not throw an error as such. I am confused about what is going on inside it.