Can I somehow keep a linked copy of the original untokenized input data for the text_classifier_learner

andres · September 29, 2020, 8:44pm

Hi,
I’m using fastai 1.0.61.
My use case is that I do some text classification and I present the text inputs with the biggest losses from TextClassificationInterpretation.top_losses() to the user to recheck if they are classified correctly.
Now, the TextClassificationInterpretation.data as well as the DataBunch for the classifier only has the tokenized versions of the inputs, so to change the label of the input I need to somehow link the tokenized text to the original. Now, there are some ways I could do it, but these seem very inefficient to me, like:

Retokenize all the inputs and check which one matches the tokenized version
Save the tokenized version in a database along with the original and match from there - would need to update the DB each time I retrain the model, which I expect to happen often.

Is there any easier way to achieve this in fastai?