Xxbos: is it marking beginning of sentence or beginning of text?

Generally, BOS indicates the “Beginning Of Sentence” and EOS for the “End Of Sentence”.

But I found the docs says what these special tokens mean:

  • UNK (xxunk) is for an unknown word (one that isn’t present in the current vocabulary)
  • BOS (xxbos) represents the beginning of a text in your dataset
  • FLD (xxfld) is used if you set mark_fields=True in your TokenizeProcessor to separate the different fields of texts (if your texts are loaded from several columns in a dataframe)
  • TK_MAJ (xxmaj) is used to indicate the next word begins with a capital in the original text
  • TK_UP (xxup) is used to indicate the next word is written in all caps in the original text
  • TK_REP (xxrep) is used to indicate the next character is repeated n times in the original text (usage xxrep n {char})
  • TK_WREP (xxwrep) is used to indicate the next word is repeated n times in the original text (usage xxwrep n {word})