Recommendation: Add arguments to "_join_texts" that dictate the inclusion of BOS and EOS tokens in text

This would make the function more reusable for seq2seq tasks where I’m often seeing examples where the “source” sequence does not includes the BOS token while the “target” sequence does.

This would be a nice addition to the mark_fields argument that already exists and dictates the includes of FLD tokens. My recommendation is that the signature be:

def _join_texts(texts:Collection[str], mark_fields:bool=False, 
                include_bos:bool=True, include_eos:bool=False):

Note that we don’t have an eos tag since it’s redundant with bos. We can certainly add the bos flag, and if you want to make a PR to add this bool, it would be most welcome.

Will do.

Regarding the EOS tag, do you think it’s redundant even in the case of seq2seq datasets for things like NMT?

Oh yes, it’s true we haven’t tackled seq2seq yet, but that will require it.

Yah that is what I’m working on … I didn’t want to step on anything you all were planning/doing but if you like, I’ll submit the PR for both dictating BOS and EOS tags if that works for you all. lmk.

Don’t hesitate to submit something. I’ll build on it if I need to add other things on top of it.