Applying ULMFit to genomic sequences - help with TextBlock.from_df needed

kevinh · September 4, 2020, 2:29pm

Hello!

I am trying to apply the ULMFit approach to genomic sequences in order to compete here: [https://www.drivendata.org/competitions/63/genetic-engineering-attribution/page/165/]

I thought it would be a good way to practice the concept showed here: [https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb]

Some context about the problem: DNA engineered sequences are stored a csv file along with the lab id were they come from. The problem is a classification problem (given a sequence, determine the lab id).

I’m trying to use SubwordTokenizer and TextBlock.from_df but I get an error and I haven’t been able to find a solution.

Here’s an example to reproduce:

train=pd.DataFrame([['catgcattagttattaatagtgatgcntg'], 
                    ['gctggatggtttgggacatgatggtttgggacatgatggtttgggacatg'], 
                    ['nnccgggctgtagctacacatacataca'], 
                    ['gcggagatgaagagccctac']], 
                   columns=['sequence'])

That’s how I’m trying to define the DataLoaders:

dls_lm = DataBlock(
    blocks=TextBlock.from_df('sequence', is_lm=True, tok=SubwordTokenizer(vocab_sz=20)),
    splitter=RandomSplitter(0.1)
).dataloaders(train[['sequence']])

The error I get:

/usr/local/lib/python3.6/dist-packages/fastai/text/data.py in <listcomp>(.0)
     46             self.o2i = defaultdict(int, {v:k for k,v in enumerate(self.vocab) if v != 'xxfake'})
     47 
---> 48     def encodes(self, o): return TensorText(tensor([self.o2i  [o_] for o_ in o]))
     49     def decodes(self, o): return L(self.vocab[o_] for o_ in o if self.vocab[o_] != self.pad_tok)

TypeError: unhashable type: ‘L’

When I go in the data.py and print the object ‘o’, I see this:

text           [▁xxbos, ▁g, c, tg, g, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, +]
text_length                                                                                                                                                                                                                           57
Name: 1, dtype: object

So I when I put o[‘text’] in the ‘encodes’ function instead of just ‘o’, I see it is working… but I’m clearly not doing it right from the beginning…
Any help would be appreciated!

muellerzr · September 4, 2020, 2:46pm

You also need a get_x to grab the column. See the text portion of the DataBlock tutorial: https://docs.fast.ai/tutorial.datablock#Text

amritv · September 4, 2020, 3:10pm

You could also take a look at this excellent notebook by @marcossantana

github.com

Marcosuff/Marcosuff.github.io/blob/master/fastaiv2_works/SMILES_LSTM_fastaiv2.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "SMILES-LSTM-fastaiv2.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "authorship_tag": "ABX9TyOLTJrvAkUPUoxrvVe+r9FJ",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",

This file has been truncated. show original

kevinh · September 4, 2020, 4:20pm

Great, I was able to go up to the classifier step and run the fit_one_cycle on my learner so now I can start improving my process, very cool.
Thank you!

s2k · October 10, 2020, 8:40am

@muellerzr Hi! Do you know if TextBlock.from_df would work for an Image to Text dataloader?

I’m getting RuntimeError: stack expects each tensor to be equal size, but got [2] at entry 0 and [3] at entry 18 for the following:

db = DataBlock(blocks=(ImageBlock, TextBlock.from_df('text')), get_x=get_specs_from_df, get_y=attrgetter('text'))
dls = db.dataloaders(df, bs=64)
dls.one_batch()

It seems like a padding issue but not sure how to use the fastai methods here

s2k · October 10, 2020, 9:45am

dls.show_batch(nrows=2, ncols=3) does return some data, probably because there are no stacking error in that particular batch. However, I’m confused at the text - all I get is xxbos xxunk for every data point. Am I doing something wrong?