I’m trying to better understand Transformers by applying them to a few toy problems. One such problem is: Given a sequence of numbers, output 1 for all adjacent duplicates, output 0 otherwise. For example:
input = 0 3 5 9 3 3 5 2 5
output = 0 0 0 0 1 1 0 0 0
For short sequences (eg. 10 items) I can solve this problem with a BERT model.
However for long sequences (eg. 100 items) with very few (eg. 2) adjacent duplicates, my model only outputs 0
for all tokens. I imagine this is because most (98%) of the true outputs should be 0
. This problem reminds me of class imbalance issues, but it’s not quite the same. Does anyone have any insight into how I might be able to handle this?
Things I’ve tried that didn’t work:
- Applying FocalLoss didn’t seem to work.
- Training on a short sequence but applying the model to a longer sequence
An approach I’ve tried that worked:
- Train a model on sequences of length 10
- Train the same model on sequences of length 20
- Train the same model on sequences of length 40
- Train the same model on sequences of length 100
While this approach works in this particular case, I’m still left wondering if there’s a good way to handle training a Transformer directly against a sparse output. I can imagine some problems where this might occur and I cannot train against smaller problems and scale them up. (eg. Given a sequence of DNA data (ACTG) output 0
if nothing interesting happens here, output 1
if it’s a methylation site or transcription start site etc.)