How would I go about inputting my seq2seq data using a DataBlock? The DataBunch class is not used in v2 and there are no clear examples of how to accomplish this. I’ve accomplished this task using TorchText and Pytorch but I want to take advantage of the fastai Learner class since it seems to produce better results.
Put your valid & train sequences into one Pandas DataFrame that contains a is_valid column.
When specifying your DataBlock blocks, use two TextBlocks.
Your get_x (source sequence) and get_y (target sequence) will have the ‘text’ column automatically added.
Use ColSplitter as your splitter. Our data contains a “is_valid” column that will tell it which row in the df is the valid or train data.
logs = DataBlock(
# blocks specify what type of data we are going to be loading.
# In this case both are text files contained in the same df
blocks=(TextBlock.from_df('from_txt',is_lm=False),TextBlock.from_df('to_txt',is_lm=False)),
# The TestBlock tokenization process puts tokenized inputs into a column called text.
# The ColReader for get_x will always reference text, even if the original text inputs
# were in a column with another name in the dataframe.
get_x=ColReader('text'),
get_y=ColReader('text'),
# The dataframe needs to have a is_valid column for this to work.
splitter=ColSplitter()
)
I do have some questions:
How do I control the tokenization of the sequences from here?
Is there a high level data flow document that tells us what exactly happens when we feed in a df to a TextBlock?
How do I specify the dataframe to be processed? Right now I’m assuming that fastai just looks for the dataframe named ‘df’???
I think I’ve figured out question 1. How do I control the tokenization of the sequences from here?
TLDR: pass in a tok argument to each TextBlock
logs = DataBlock(
# blocks specify what type of data we are going to be loading.
# In this case both are text files contained in the same df.
# specify your tokenizer using the tok argument
blocks=(
TextBlock.from_df('from_txt', is_lm=False, tok=SubwordTokenizer(vocab_sz=200)),
TextBlock.from_df('to_txt' , is_lm=False, tok=SubwordTokenizer(vocab_sz=200))),
# The TestBlock tokenization process puts tokenized inputs into a column called text.
# The ColReader for get_x will always reference text, even if the original text inputs
# were in a column with another name in the dataframe.
get_x=ColReader('text'),
get_y=ColReader('text'),
# The dataframe needs to have a is_valid column for this to work.
splitter=ColSplitter()
)
I still have these questions if anyone knows.
I do have some questions:
Is there a high level data flow document that tells us what exactly happens when we feed in a df to a TextBlock?
How do I specify the dataframe to be processed? Right now I’m assuming that fastai just looks for the dataframe named ‘df’???
Is there a full list of tokenizers? I don’t see anything in the documentation for SubwordTokenizer. I found this by poking around in the code.
how can I use the .txt file to build a translation model just like in the PyTorch Seq2Seq model with the fastai library?
and is there a way to Tokenize languages that have no Spacy tokenized model?