I am trying to implement the “Show, Attend and Tell” paper. I thinking about the possibility of using the AWD_LSTM pre-trained language model for caption generation.
Firstly, I want to know whether my approach for using pre-trained model is correct, here is the code:
## language data bunch
data_lm = (TextList.from_df(df=metadata,path='.',cols='labels')
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=100))
# create learner object
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3,pretrained=True)
# fine tune model
learn.fit_one_cycle(8, 1e-2, moms=(0.8,0.7))
# save model parameters
learn.save('fine_tuned_LM')
learn.save_encoder('fine_tuned_LM_Encoder')
## load model parameters
pretrained_lstm = AWD_LSTM(vocab_size, emb_sz=812, n_hid = 512,n_layers= 1)
wgts = torch.load('models/fine_tuned_LM.pth')
params = list(zip(wgts.items(),pretrained_lstm.state_dict().items()))
for p in params:
name = p[1][0]
pretrained_lstm.state_dict()[name] = p[0][1]
I have a few questions on this:
-
I have to pass a modified hidden state, the output of the attention model. How do I do that?
-
Since I using a pre-trained language model, it supposes to store vocabulary (word to index mapping). How this vocabulary information is stored.
-
It is enough to load just ‘fine_tuned_LM_Encoder’ or I should load entire ‘fine_tuned_LM’. In IDMB sentimental analysis tutorial, it was used for classification but for my case goal is text generation. I believe the entire model should be loaded.
-
Since I used AWD_LSTM loaded with wiki-103 weights, I had to same architecture but when I initiated with different hyperparameters (emb_sz=812, n_hid = 512) and loaded my fine-tuned weights it did not throw an error as such. I am confused about what is going on inside it.