Hello!
I am trying to apply the ULMFit approach to genomic sequences in order to compete here: [https://www.drivendata.org/competitions/63/genetic-engineering-attribution/page/165/]
I thought it would be a good way to practice the concept showed here: [https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb]
Some context about the problem: DNA engineered sequences are stored a csv file along with the lab id were they come from. The problem is a classification problem (given a sequence, determine the lab id).
I’m trying to use SubwordTokenizer and TextBlock.from_df but I get an error and I haven’t been able to find a solution.
Here’s an example to reproduce:
train=pd.DataFrame([['catgcattagttattaatagtgatgcntg'],
['gctggatggtttgggacatgatggtttgggacatgatggtttgggacatg'],
['nnccgggctgtagctacacatacataca'],
['gcggagatgaagagccctac']],
columns=['sequence'])
That’s how I’m trying to define the DataLoaders:
dls_lm = DataBlock(
blocks=TextBlock.from_df('sequence', is_lm=True, tok=SubwordTokenizer(vocab_sz=20)),
splitter=RandomSplitter(0.1)
).dataloaders(train[['sequence']])
The error I get:
/usr/local/lib/python3.6/dist-packages/fastai/text/data.py in <listcomp>(.0) 46 self.o2i = defaultdict(int, {v:k for k,v in enumerate(self.vocab) if v != 'xxfake'}) 47 ---> 48 def encodes(self, o): return TensorText(tensor([self.o2i [o_] for o_ in o])) 49 def decodes(self, o): return L(self.vocab[o_] for o_ in o if self.vocab[o_] != self.pad_tok)
TypeError: unhashable type: ‘L’
When I go in the data.py and print the object ‘o’, I see this:
text [▁xxbos, ▁g, c, tg, g, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, a, tg, g, ▁xxrep, ▁, 3, ▁, t, ▁xxrep, ▁, 3, ▁g, ▁, a, c, a, tg, +] text_length 57 Name: 1, dtype: object
So I when I put o[‘text’] in the ‘encodes’ function instead of just ‘o’, I see it is working… but I’m clearly not doing it right from the beginning…
Any help would be appreciated!