Seq2Seq - DataBlock?

Hello.

I would like to use LSTMs for a seq2seq task. The book does not cover seq2seq and the NLP course (https://github.com/fastai/course-nlp/blob/master/7-seq2seq-translation.ipynb) does not work with the v2 API.

How would I go about inputting my seq2seq data using a DataBlock? The DataBunch class is not used in v2 and there are no clear examples of how to accomplish this. I’ve accomplished this task using TorchText and Pytorch but I want to take advantage of the fastai Learner class since it seems to produce better results.

Thanks

2 Likes

So I guess I’ll try to answer my own question. (This post helped a ton!)

Here is a Jupyter Notebook that contains all the code.

The TLDR:

  1. Put your valid & train sequences into one Pandas DataFrame that contains a is_valid column.

  2. When specifying your DataBlock blocks, use two TextBlocks.

  3. Your get_x (source sequence) and get_y (target sequence) will have the ‘text’ column automatically added.

  4. Use ColSplitter as your splitter. Our data contains a “is_valid” column that will tell it which row in the df is the valid or train data.

logs = DataBlock(

# blocks specify what type of data we are going to be loading.
# In this case both are text files contained in the same df
blocks=(TextBlock.from_df('from_txt',is_lm=False),TextBlock.from_df('to_txt',is_lm=False)),

# The TestBlock tokenization process puts tokenized inputs into a column called text. 
# The ColReader for get_x will always reference text, even if the original text inputs 
# were in a column with another name in the dataframe.
get_x=ColReader('text'),
get_y=ColReader('text'),

# The dataframe needs to have a is_valid column for this to work.
splitter=ColSplitter()

)

I do have some questions:

  1. How do I control the tokenization of the sequences from here?

  2. Is there a high level data flow document that tells us what exactly happens when we feed in a df to a TextBlock?

  3. How do I specify the dataframe to be processed? Right now I’m assuming that fastai just looks for the dataframe named ‘df’???

1 Like

I think I’ve figured out question 1. How do I control the tokenization of the sequences from here?

TLDR: pass in a tok argument to each TextBlock

logs = DataBlock(

    # blocks specify what type of data we are going to be loading.
    # In this case both are text files contained in the same df.
    # specify your tokenizer using the tok argument
    blocks=(
        TextBlock.from_df('from_txt', is_lm=False, tok=SubwordTokenizer(vocab_sz=200)),
        TextBlock.from_df('to_txt'  , is_lm=False, tok=SubwordTokenizer(vocab_sz=200))),

    # The TestBlock tokenization process puts tokenized inputs into a column called text. 
    # The ColReader for get_x will always reference text, even if the original text inputs 
    # were in a column with another name in the dataframe.
    get_x=ColReader('text'),
    get_y=ColReader('text'),

    # The dataframe needs to have a is_valid column for this to work.
    splitter=ColSplitter()

)

I still have these questions if anyone knows.

I do have some questions:

  1. Is there a high level data flow document that tells us what exactly happens when we feed in a df to a TextBlock?

  2. How do I specify the dataframe to be processed? Right now I’m assuming that fastai just looks for the dataframe named ‘df’???

  3. Is there a full list of tokenizers? I don’t see anything in the documentation for SubwordTokenizer. I found this by poking around in the code.

How do I translate from English to local language like Wolof?
using Seq2Seq

You would need some examples of text from your source language to the target language. Then you could train an encoder + decoder architecture.

I recommend this course to gain an understanding https://www.fast.ai/2019/07/08/fastai-nlp/

Thank you

how can I use the .txt file to build a translation model just like in the PyTorch Seq2Seq model with the fastai library?
and is there a way to Tokenize languages that have no Spacy tokenized model?