ULMFiT Language Model encoding

(Raj) #1

Does anyone know an easy way to get the encoding of a new document with a Language Model trained in ULMFiT?

learn.predict() outputs the last layer/score, but how to I extract the encoding before that?

I have tried to run input through just the encoder, but it is not working at all. Here is what my code looks like:

#load data
val_sent = np.load('match_lm_data/tmp/val_ids.npy')
val_lbls = np.load('match_lm_data/tmp/lbl_val.npy')
val_ds = TextDataset(val_sent, val_lbls)
val_samp = SortSampler(val_sent, key=lambda x: len(val_sent[x]))
val_lbls_sampled = val_lbls[list(val_samp)]
val_dl = DataLoader(val_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=val_samp)
md = ModelData('match_lm_data', None, val_dl)

m = get_rnn_classifer(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
            layers=[em_sz*3, 50, c], drops=[0., 0.])

m2 = m[0]  #take just encoder

learn = RNN_Learner(md, TextModel(to_gpu(m2)))


When I do this, it gives me the following error: "AttributeError: ‘MultiBatchRNN’ object has no attribute ‘hidden’.

How do I solve this? Is there any easier/more elegant way to extract the encoding?



@rajicon I ran into the same issue, was able to figure it out eventually. Wrote up the solution here.

Short answer is:

def process_doc(learn, doc):
    xb, yb = learn.data.one_item(doc)
    return xb

def encode_doc(learn, doc):
    xb = process_doc(learn, doc)
    # Reset initializes the hidden state
    awd_lstm = learn.model[0]
    with torch.no_grad():
        out = awd_lstm.eval()(xb)
    # Return final output, for last RNN, on last token in sequence
    return out[0][2][0][-1].detach().numpy()