Applying ULMFit to genomic sequences - help with TextBlock.from_df needed


I am trying to apply the ULMFit approach to genomic sequences in order to compete here: []

I thought it would be a good way to practice the concept showed here: []

Some context about the problem: DNA engineered sequences are stored a csv file along with the lab id were they come from. The problem is a classification problem (given a sequence, determine the lab id).

I’m trying to use SubwordTokenizer and TextBlock.from_df but I get an error and I haven’t been able to find a solution.

Here’s an example to reproduce:


That’s how I’m trying to define the DataLoaders:

dls_lm = DataBlock(
    blocks=TextBlock.from_df('sequence', is_lm=True, tok=SubwordTokenizer(vocab_sz=20)),

The error I get:

/usr/local/lib/python3.6/dist-packages/fastai/text/ in <listcomp>(.0)
     46             self.o2i = defaultdict(int, {v:k for k,v in enumerate(self.vocab) if v != 'xxfake'})
---> 48     def encodes(self, o): return TensorText(tensor([self.o2i  [o_] for o_ in o]))
     49     def decodes(self, o): return L(self.vocab[o_] for o_ in o if self.vocab[o_] != self.pad_tok)

TypeError: unhashable type: ‘L’

When I go in the and print the object ‘o’, I see this:

text           [▁xxbos, ▁g, c, tg, g, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, +]
text_length                                                                                                                                                                                                                           57
Name: 1, dtype: object

So I when I put o[‘text’] in the ‘encodes’ function instead of just ‘o’, I see it is working… but I’m clearly not doing it right from the beginning…
Any help would be appreciated!

You also need a get_x to grab the column. See the text portion of the DataBlock tutorial:


You could also take a look at this excellent notebook by @marcossantana


Great, I was able to go up to the classifier step and run the fit_one_cycle on my learner so now I can start improving my process, very cool.
Thank you!

1 Like

@muellerzr Hi! Do you know if TextBlock.from_df would work for an Image to Text dataloader?

I’m getting RuntimeError: stack expects each tensor to be equal size, but got [2] at entry 0 and [3] at entry 18 for the following:

db = DataBlock(blocks=(ImageBlock, TextBlock.from_df('text')), get_x=get_specs_from_df, get_y=attrgetter('text'))
dls = db.dataloaders(df, bs=64)

It seems like a padding issue but not sure how to use the fastai methods here

dls.show_batch(nrows=2, ncols=3) does return some data, probably because there are no stacking error in that particular batch. However, I’m confused at the text - all I get is xxbos xxunk for every data point. Am I doing something wrong?