My objective is to cluster similar documents. I want to use the hidden state of the encoder of a finetuned ULMFiT model as input for a clustering algorithm. I’m trying it out with the IMDB notebook.
To obtain the hidden state, I think I managed to get a callback working with snippets from the forum, but I’m struggling to understand output of the callback
Learn.model[0]:
AWD_LSTM( (encoder): Embedding(60000, 400, padding_idx=1) (encoder_dp): EmbeddingDropout( (emb): Embedding(60000, 400, padding_idx=1) ) (rnns): ModuleList( (0): WeightDropout( (module): LSTM(400, 1152, batch_first=True) ) (1): WeightDropout( (module): LSTM(1152, 1152, batch_first=True) ) (2): WeightDropout( (module): LSTM(1152, 400, batch_first=True) ) ) (input_dp): RNNDropout() (hidden_dps): ModuleList( (0): RNNDropout() (1): RNNDropout() (2): RNNDropout() ) )
The callback:
class StoreHook(HookCallback): def on_train_begin(self, **kwargs): super().on_train_begin(**kwargs) self.acts = [] def hook(self, m, i, o): return o def on_train_end(self, train, **kwargs): #change into on batch end when I understand the output of on_train_end self.acts += self.hooks.stored
The code
path = untar_data(URLs.IMDB,force_download=False) path.ls() data_lm = load_data(path,'tmp_lm2',bs=48) learn = language_model_learner(data_lm, pretrained=True, drop_mult=0.5,arch=AWD_LSTM) cb = [StoreHook(learn, modules=flatten_model(learn.model[0]))] learn.callbacks += cb learn.model.eval() # to not update the weights learn.model.reset() # don't know if this is necessary learn.fit(1)
Understanding the output
print(f'{len(learn.callbacks[1].acts)} items in the output \n') for i,act in enumerate(learn.callbacks[1].acts): print(f'item {i}') if not act is None: print(act.shape)
Yields:
12 items in the output:
item 0
item 1
item 2
item 3
item 4
item 5
item 6
item 7
item 8
torch.Size([48, 70, 400])
item 9
torch.Size([48, 70, 1152])
item 10
torch.Size([48, 70, 1152])
item 11
The output of one batch is a list of 12 items, most of them NoneType
- Why 12 items? It doesn’t correspond to the bs = 48?
edit: I figured this more or less out, it has nothing to do with bs, but corresponds to all the modules in learn.model[0]. The batch size can be found in the shape of the tensors - Why are most of them NoneType?
- The items that have tensors have different shape, which layer is which? The decoder has 400 features as input, so is [48, 70, 400] the one I’m looking for? edit: I tried to index the separate modules to figure this out but AWD_LSTM does not support indexing when I try to hook e.g. learn.model[0][8]
Ideally I want to piece the output back to encodings per document. I’m currently at a level of understanding matching part 1 of the course, but the content of my pet project requires me to try to get this working The videos have helped in understanding callbacks, but I could surely use your help for applying it to AWD LSTM.