Recommendation: Add arguments to "_join_texts" that dictate the inclusion of BOS and EOS tokens in text

wgpubs · March 11, 2019, 5:49pm

This would make the function more reusable for seq2seq tasks where I’m often seeing examples where the “source” sequence does not includes the BOS token while the “target” sequence does.

This would be a nice addition to the mark_fields argument that already exists and dictates the includes of FLD tokens. My recommendation is that the signature be:

def _join_texts(texts:Collection[str], mark_fields:bool=False, 
                include_bos:bool=True, include_eos:bool=False):

sgugger · March 11, 2019, 9:24pm

Note that we don’t have an eos tag since it’s redundant with bos. We can certainly add the bos flag, and if you want to make a PR to add this bool, it would be most welcome.

wgpubs · March 11, 2019, 10:03pm

Will do.

Regarding the EOS tag, do you think it’s redundant even in the case of seq2seq datasets for things like NMT?

sgugger · March 11, 2019, 10:07pm

Oh yes, it’s true we haven’t tackled seq2seq yet, but that will require it.

wgpubs · March 11, 2019, 10:09pm

Yah that is what I’m working on … I didn’t want to step on anything you all were planning/doing but if you like, I’ll submit the PR for both dictating BOS and EOS tags if that works for you all. lmk.

sgugger · March 11, 2019, 10:37pm

Don’t hesitate to submit something. I’ll build on it if I need to add other things on top of it.