Question about padding in NLP

nicoIas · November 17, 2022, 10:44pm

From what I understand, TextBlock uses seq_len argument to add padding by passing it to pad_input_chunk. Is this functionality the same when creating a DataBlock?

Also, does that mean that for a task where I need my text rows to be of the same size (ex. classification problem), do I pass along the length of the longest text row as seq_len? Or am I mistaken?

Thanks in advance!

wgpubs · November 19, 2022, 8:26pm

You’ll need to be dealing with a rectangular matrix for each mini-batch and you’ll normally see padding applied at that time rather than across the entire dataset because it uses less memory and includes more relevant tokens in each batch (e.g., less padding tokens) assuming you’re sorting by something close to sequence length. Using this approach, padding tokens are added so that each sequence is the same length as the longest sequence in the mini-batch.

If you are using HF, there are a variety of options you can explore with respect to padding, sequence length, and truncation.

If you’re interested to see how I do it in blurr, this is a good place to start.