Dealing with sparse outputs and transformers

I’m trying to better understand Transformers by applying them to a few toy problems. One such problem is: Given a sequence of numbers, output 1 for all adjacent duplicates, output 0 otherwise. For example:

input  = 0 3 5 9 3 3 5 2 5
output = 0 0 0 0 1 1 0 0 0

For short sequences (eg. 10 items) I can solve this problem with a BERT model.

However for long sequences (eg. 100 items) with very few (eg. 2) adjacent duplicates, my model only outputs 0 for all tokens. I imagine this is because most (98%) of the true outputs should be 0. This problem reminds me of class imbalance issues, but it’s not quite the same. Does anyone have any insight into how I might be able to handle this?

Things I’ve tried that didn’t work:

  • Applying FocalLoss didn’t seem to work.
  • Training on a short sequence but applying the model to a longer sequence

An approach I’ve tried that worked:

  • Train a model on sequences of length 10
  • Train the same model on sequences of length 20
  • Train the same model on sequences of length 40
  • Train the same model on sequences of length 100

While this approach works in this particular case, I’m still left wondering if there’s a good way to handle training a Transformer directly against a sparse output. I can imagine some problems where this might occur and I cannot train against smaller problems and scale them up. (eg. Given a sequence of DNA data (ACTG) output 0 if nothing interesting happens here, output 1 if it’s a methylation site or transcription start site etc.)