Is there a Data Blockish approach in place for seq2seq models?

I thought I’d check to see if there was a way of creating the requisite Datasets/DataLoaders via the Data Block API before embarking on a custom approach.

Thanks

1 Like

I’m also trying to wrap my head around this. No working code at the moment, but I guess that the target sequence should end in the labels. Still digging in ItemList and fastai.text.* code trying to understand how things should fit together

Yah.

A LabelList derives from Dataset and accepts it’s x and y arguments as ItemList objects … so I think you’re right to look there.

I’m going to play with this today though I fear there will probably be some easier way to do this in a few weeks :). Essentially I’m going to try the following:

  1. Pre-split my training data into two separate DataFrames, one for training and one for validation.

  2. Create two TextList objects for each dataset, one for my input sequences and one for my output sequences.

  3. Create a LabelList for each pair of TextList objects, from which I’ll create a LabelLists object.

From there, I’m hoping to use the DataBlock API mechanism to build my DataLoaders via the call to .databunch on my LabelLists object.

Will update when I find success.

SOLVED

There may be a better way already, or else something in the framework’s pipeline, to make the process of preparing datasets/dataloaders for sequence-to-sequence tasks more generic (mine here is specifically for text) but this seems to work just fine.

Create seq2seq friendly datasets using the fast.ai DataBlock API

If folks find any issues or can suggest any improvements, I’d love to hear them. I’m sure such insights would be beneficial to all her on the forums as well.

6 Likes

Thanks, will check this