Applying ULMFit to genomic sequences - help with TextBlock.from_df needed

Hello!

I am trying to apply the ULMFit approach to genomic sequences in order to compete here: [https://www.drivendata.org/competitions/63/genetic-engineering-attribution/page/165/]

I thought it would be a good way to practice the concept showed here: [https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb]

Some context about the problem: DNA engineered sequences are stored a csv file along with the lab id were they come from. The problem is a classification problem (given a sequence, determine the lab id).

I’m trying to use SubwordTokenizer and TextBlock.from_df but I get an error and I haven’t been able to find a solution.

Here’s an example to reproduce:

train=pd.DataFrame([['catgcattagttattaatagtgatgcntg'], 
                    ['gctggatggtttgggacatgatggtttgggacatgatggtttgggacatg'], 
                    ['nnccgggctgtagctacacatacataca'], 
                    ['gcggagatgaagagccctac']], 
                   columns=['sequence'])

That’s how I’m trying to define the DataLoaders:

dls_lm = DataBlock(
    blocks=TextBlock.from_df('sequence', is_lm=True, tok=SubwordTokenizer(vocab_sz=20)),
    splitter=RandomSplitter(0.1)
).dataloaders(train[['sequence']])

The error I get:

/usr/local/lib/python3.6/dist-packages/fastai/text/data.py in <listcomp>(.0)
     46             self.o2i = defaultdict(int, {v:k for k,v in enumerate(self.vocab) if v != 'xxfake'})
     47 
---> 48     def encodes(self, o): return TensorText(tensor([self.o2i  [o_] for o_ in o]))
     49     def decodes(self, o): return L(self.vocab[o_] for o_ in o if self.vocab[o_] != self.pad_tok)

TypeError: unhashable type: ‘L’

When I go in the data.py and print the object ‘o’, I see this:

text           [▁xxbos, ▁g, c, tg, g, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, +]
text_length                                                                                                                                                                                                                           57
Name: 1, dtype: object

So I when I put o[‘text’] in the ‘encodes’ function instead of just ‘o’, I see it is working… but I’m clearly not doing it right from the beginning…
Any help would be appreciated!

You also need a get_x to grab the column. See the text portion of the DataBlock tutorial: https://docs.fast.ai/tutorial.datablock#Text

3 Likes

You could also take a look at this excellent notebook by @marcossantana

4 Likes

Great, I was able to go up to the classifier step and run the fit_one_cycle on my learner so now I can start improving my process, very cool.
Thank you!

1 Like

@muellerzr Hi! Do you know if TextBlock.from_df would work for an Image to Text dataloader?

I’m getting RuntimeError: stack expects each tensor to be equal size, but got [2] at entry 0 and [3] at entry 18 for the following:

db = DataBlock(blocks=(ImageBlock, TextBlock.from_df('text')), get_x=get_specs_from_df, get_y=attrgetter('text'))
dls = db.dataloaders(df, bs=64)
dls.one_batch()

It seems like a padding issue but not sure how to use the fastai methods here

dls.show_batch(nrows=2, ncols=3) does return some data, probably because there are no stacking error in that particular batch. However, I’m confused at the text - all I get is xxbos xxunk for every data point. Am I doing something wrong?