How to create nondeterministic test dataset from text

lambdaofgod · August 31, 2019, 11:49am

I’m trying to vectorize
def encode_texts(texts, vocab):
df = pd.DataFrame({‘text’: texts + texts}) # fastai breaks when I tried to use validation size 0…
df.to_csv(’/tmp/df.csv’)
lm_data_bunch = fastai.text.TextLMDataBunch.from_csv(’/tmp’, ‘df.csv’, valid_pct=0.5, val_bs=len(texts), vocab=learn.data.vocab)
return lm_data_bunch.one_batch(‘Test’)[0][:len(texts)]

But each time I call it with the same texts I get different result, like it got shuffled.
Is there a way to disable this shuffling? Or a simpler way to encode sequences of tokens?