How to encode batch of texts using DataBunch

lambdaofgod · August 25, 2019, 12:26pm

I’m interested in using LanguageLearner for extracting features from text - for example for one text I can use

def get_model_outputs(learner, text):
    input_tensor, __ = learner.data.one_item(text)
    return learner.model[0](input_tensor)

How can I encode several texts at once?
The problem with using one_item is that for text I can’t just go and concatenate several results, because they will have different lengths.
Should I use other dataset_type, as in one_item's code? Or should I do the padding before I feed the values to one_item?

sgugger · August 25, 2019, 4:44pm

To encode several texts as one, you should put them in a test dataset then loop through learn.data.test_dl

lambdaofgod · August 29, 2019, 7:18pm

That sounds confusing. Does that mean I have to override learner’s dataset?
Could you by any chance point documentation site that might be helpful for doing that?

lambdaofgod · August 29, 2019, 8:04pm

I got to this

def encode_texts(texts, vocab):
    df = pd.DataFrame({'text': texts + texts}) # fastai breaks when I tried to use validation size 0...
    df.to_csv('/tmp/df.csv')
    lm_data_bunch = fastai.text.TextLMDataBunch.from_csv('/tmp', 'df.csv', valid_pct=0.5, val_bs=len(texts), vocab=learn.data.vocab)
    return lm_data_bunch.one_batch('Test')[0][:len(texts)]

But it seems like results are indeterministic, and I don’t see a way to disable shuffling…