Hello all. In working with the LanguageLearner.predict
method to generate text from a trained language model, I’d like to be able to generate formatted text instead of sequences of raw tokens (e.g. to display that text to an end user in some way eventually). That is, I’d like to de-tokenize the generated tokens in some sense. Here’s an example:
Current: xxmaj the quick brown fox jumps over the xxup lazy dog xxrep 3 .
Want: The quick brown fox jumps over the LAZY dog…
While I can certainly just implement a wrapper over the predict method to do what I want (which is currently what I’m doing), I think myself and others might find it useful for the predict method itself to be able to do this. Perhaps this can be a flag to pass in (e.g. formatted=True
)? I can fairly easily implement this myself, I think, if nobody wants to take it up. Thoughts?