Text Classification on Every Token

I’m trying to create a dataloader for text classification where every token is labeled. For example you would have an input sequence of tokens

T1 T2 T3 T4 T5 ...

Which would generate predictions for a sequence of labels

0 1 0 2 0 ...

I’m trying to think of what would be the best way to structure the Itemlist/dataloader to plug into fastai. Currently I have labels in the form of a list with one item per token in the corresponding data.

I’m also wondering what would be the best way to manage padding for batching inputs of different length. If I had two sequences if different length with front padding:

T1 T2 T3 T4 T5
00 00 00 T1 T2

The model output would have predictions for every token, including padding. I wouldn’t want predictions from padding to incur any loss, but I’m not sure what would be the best way to structure that. One idea would be to pass a padding mask to the loss function to zero out losses at padding sites. Another would be to have a “padding output” value as a legitimate y output and hope the model learns to map padding input to that particular class.

I also need to figure out a way to pad the output values the same as the input values to batch things together.

I’m going to be playing around with this over the next few days, but any thoughts or advice would be appreciated.