The purpose of xxrep

rob8271 · October 19, 2021, 9:40am

Hi everyone.

From fastai’s docs:

TK_REP (xxrep) is used to indicate the next character is repeated n times in the original text (usage xxrep n {char})

Could someone explain the purpose of this special token? I haven’t found any explanation on why this token is used, no topics on why it’s better to use xxrep, etc. Let’s say we have a text generation problem. My guess xxrep is used to restrict the model, in that way it won’t generate too many repetitions such as “aaaaaa”. In other words, it memorizes less character repetitions during training.

Any answer would be really helpful. Thank you.

Conwyn · October 19, 2021, 5:30pm

Hi Rob

It is to conolidate people who write ??? with people who write ?.

Regards Conwyn

rob8271 · October 19, 2021, 6:18pm

I don’t think so. In that way, it will be applied to every pair: “zzz” vs “z”, “999” vs “9”, “shhh” vs “sh”, etc.

Conwyn · October 20, 2021, 6:41pm

Hi Rob
Please see page 334 final paragraph of the book.
Regards Conwyn

rob8271 · October 20, 2021, 7:59pm

Thank you, now it’s more clear.

However, I have one more question. How does it affect xxwrep? I mean repeated words are the same anyway and do not affect the embedding matrix, am I wrong?

Conwyn · October 20, 2021, 8:25pm

Hi Rob.

Jim had had, had had. Had had, had had the master’s approval.
So would you really want that to be Jim (1), had(xxwrep 2),had (xxrep 2). Had(xxrep 2), had (xxrep 2) assuming puncuation is a valid token.

Regards Conwyn

rob8271 · October 21, 2021, 8:54am

Didn’t get it here.

First of all, it converts to xxwrep only if there are 3 or more words in a row. But it doesn’t really matter for now, let’s say it can do the same when there are 2 words.

Your example “Jim had had, had had. Had had, had had the master’s approval” will be (xxmaj is omitted):

[‘xxbos’, ‘jim’, ‘had’, ‘had’, ‘,’, ‘had’, ‘had’, ‘.’, ‘had’, ‘had’, ‘,’, ‘had’, ‘had’, ‘the’, ‘master’, ‘’s’, ‘approval’] when xxwrep is not allowed
[‘xxbos’, ‘jim’, ‘xxwrep’, ‘2’, ‘had’, ‘,’, ‘xxwrep’, ‘2’, ‘had’, ‘.’, ‘xxwrep’, ‘2’, ‘had’, ‘,’, ‘xxwrep’, ‘2’, ‘had’, ‘the’, ‘master’, ‘’s’, ‘approval’] when xxwrep is allowed

I don’t see how it could really help here.

rob8271 · October 30, 2021, 8:31am

Still unresolved regarding xxwrep. Please anyone