Seq2Seq - DataBlock?

goralpl · February 7, 2021, 9:46pm

Hello.

I would like to use LSTMs for a seq2seq task. The book does not cover seq2seq and the NLP course (https://github.com/fastai/course-nlp/blob/master/7-seq2seq-translation.ipynb) does not work with the v2 API.

How would I go about inputting my seq2seq data using a DataBlock? The DataBunch class is not used in v2 and there are no clear examples of how to accomplish this. I’ve accomplished this task using TorchText and Pytorch but I want to take advantage of the fastai Learner class since it seems to produce better results.

Thanks

goralpl · February 8, 2021, 5:01pm

So I guess I’ll try to answer my own question. (This post helped a ton!)

Here is a Jupyter Notebook that contains all the code.

The TLDR:

Put your valid & train sequences into one Pandas DataFrame that contains a is_valid column.
When specifying your DataBlock blocks, use two TextBlocks.
Your get_x (source sequence) and get_y (target sequence) will have the ‘text’ column automatically added.
Use ColSplitter as your splitter. Our data contains a “is_valid” column that will tell it which row in the df is the valid or train data.

logs = DataBlock(

# blocks specify what type of data we are going to be loading.
# In this case both are text files contained in the same df
blocks=(TextBlock.from_df('from_txt',is_lm=False),TextBlock.from_df('to_txt',is_lm=False)),

# The TestBlock tokenization process puts tokenized inputs into a column called text. 
# The ColReader for get_x will always reference text, even if the original text inputs 
# were in a column with another name in the dataframe.
get_x=ColReader('text'),
get_y=ColReader('text'),

# The dataframe needs to have a is_valid column for this to work.
splitter=ColSplitter()

)

I do have some questions:

How do I control the tokenization of the sequences from here?
Is there a high level data flow document that tells us what exactly happens when we feed in a df to a TextBlock?
How do I specify the dataframe to be processed? Right now I’m assuming that fastai just looks for the dataframe named ‘df’???

goralpl · February 8, 2021, 7:28pm

I think I’ve figured out question 1. How do I control the tokenization of the sequences from here?

TLDR: pass in a tok argument to each TextBlock

logs = DataBlock(

    # blocks specify what type of data we are going to be loading.
    # In this case both are text files contained in the same df.
    # specify your tokenizer using the tok argument
    blocks=(
        TextBlock.from_df('from_txt', is_lm=False, tok=SubwordTokenizer(vocab_sz=200)),
        TextBlock.from_df('to_txt'  , is_lm=False, tok=SubwordTokenizer(vocab_sz=200))),

    # The TestBlock tokenization process puts tokenized inputs into a column called text. 
    # The ColReader for get_x will always reference text, even if the original text inputs 
    # were in a column with another name in the dataframe.
    get_x=ColReader('text'),
    get_y=ColReader('text'),

    # The dataframe needs to have a is_valid column for this to work.
    splitter=ColSplitter()

)

I still have these questions if anyone knows.

I do have some questions:

Is there a high level data flow document that tells us what exactly happens when we feed in a df to a TextBlock?
How do I specify the dataframe to be processed? Right now I’m assuming that fastai just looks for the dataframe named ‘df’???
Is there a full list of tokenizers? I don’t see anything in the documentation for SubwordTokenizer. I found this by poking around in the code.

fiture99 · February 11, 2021, 10:29am

How do I translate from English to local language like Wolof?
using Seq2Seq

goralpl · February 12, 2021, 8:36pm

You would need some examples of text from your source language to the target language. Then you could train an encoder + decoder architecture.

I recommend this course to gain an understanding https://www.fast.ai/2019/07/08/fastai-nlp/

fiture99 · February 23, 2021, 2:34pm

Thank you

fiture99 · March 3, 2021, 3:57pm

how can I use the .txt file to build a translation model just like in the PyTorch Seq2Seq model with the fastai library?
and is there a way to Tokenize languages that have no Spacy tokenized model?