Using Language model from ULMFit as an auto-encoder

Hey @youcefjd,

I don’t know for sure, but process_doc is intended for a single string because of its use of the one_item method. If speed isn’t a constraint in your use case, you could just do a list comprehension like:
[encode_doc(doc) for doc in my_dataset]

I’m sure there’s a much more efficient way to do it if you can get all of your docs into one batch and pass that through the AWD LSTM - you’d have to change encode_doc to return the entire batch rather than the first item (in my case I was assuming the batch had a single item).

Not sure if someone is still struggling with this; I have been and couldn’t find what I was looking for on the forums.

This is what I’ve come up with and it made processing a fairly large dataset pretty painless. (<3 minutes when sending data in batches vs ~2hours send one document at a time).

I load my classification model, then add the data I want scored as a test set:

learn = load_learner('classification_model')
full_TL = TextList.from_df(df = newUtt_PD, path=path, cols=['Word'])
learn.data.add_test(full_TL)

I then pull out the data loader again (this feels very ‘hackey’, I’m sure there is a better way)

dl1 = learn.data.dl(ds_type=DatasetType.Test)

I then altered the functions above to now just be:

def getembs(mod, btch):
    res=[]
    res.append(mod(btch)[0][2].max(1).values.cpu().detach().numpy())
    return res

awd_lstm = learn.model[0]
awd_lstm.reset()

Now I can just call all of it and then stack the results:

batches = [getembs(awd_lstm, i[0])[0] for i in dl1]
encodings = np.vstack(batches)

I then ran the encodings through TSNE and got pretty pictures like this;

1 Like