I’m trying to better understand Transformers by applying them to a few toy problems. One such problem is: Given a sequence of numbers, output 1 for all adjacent duplicates, output 0 otherwise. For example:

```
input = 0 3 5 9 3 3 5 2 5
output = 0 0 0 0 1 1 0 0 0
```

For short sequences (eg. 10 items) I can solve this problem with a BERT model.

However for long sequences (eg. 100 items) with very few (eg. 2) adjacent duplicates, my model only outputs `0`

for all tokens. I imagine this is because most (98%) of the true outputs should be `0`

. This problem reminds me of class imbalance issues, but it’s not quite the same. Does anyone have any insight into how I might be able to handle this?

Things I’ve tried that didn’t work:

- Applying FocalLoss didn’t seem to work.
- Training on a short sequence but applying the model to a longer sequence

An approach I’ve tried that worked:

- Train a model on sequences of length 10
- Train the same model on sequences of length 20
- Train the same model on sequences of length 40
- Train the same model on sequences of length 100

While this approach works in this particular case, I’m still left wondering if there’s a good way to handle training a Transformer directly against a sparse output. I can imagine some problems where this might occur and I cannot train against smaller problems and scale them up. (eg. Given a sequence of DNA data (ACTG) output `0`

if nothing interesting happens here, output `1`

if it’s a methylation site or transcription start site etc.)