TK_REP (xxrep) is used to indicate the next character is repeated n times in the original text (usage xxrep n {char})
Could someone explain the purpose of this special token? I haven’t found any explanation on why this token is used, no topics on why it’s better to use xxrep, etc. Let’s say we have a text generation problem. My guess xxrep is used to restrict the model, in that way it won’t generate too many repetitions such as “aaaaaa”. In other words, it memorizes less character repetitions during training.
However, I have one more question. How does it affect xxwrep? I mean repeated words are the same anyway and do not affect the embedding matrix, am I wrong?
Jim had had, had had. Had had, had had the master’s approval.
So would you really want that to be Jim (1), had(xxwrep 2),had (xxrep 2). Had(xxrep 2), had (xxrep 2) assuming puncuation is a valid token.
First of all, it converts to xxwrep only if there are 3 or more words in a row. But it doesn’t really matter for now, let’s say it can do the same when there are 2 words.
Your example “Jim had had, had had. Had had, had had the master’s approval” will be (xxmaj is omitted):
[‘xxbos’, ‘jim’, ‘had’, ‘had’, ‘,’, ‘had’, ‘had’, ‘.’, ‘had’, ‘had’, ‘,’, ‘had’, ‘had’, ‘the’, ‘master’, ‘’s’, ‘approval’] when xxwrep is not allowed