NLP textblock.from_df show_batch, learn.predict errors help request

I’m trying to build my own NLP project using a Kaggle dataframe, and for the most part things seem to building out ok. I have run into two errors that I think are related.

First, after I build my datablock:
dls_block = DataBlock( blocks=(TextBlock.from_df('blurb', is_lm=true), CategoryBlock), get_x=ColReader('text'), get_y=ColReader('state'), splitter=RandomSplitter(0.1) ).dataloaders(df, path=path, bs=16, seq_len=80)

I try and show a batch using dls_block.show_batch()

but receive this error:

IndexError                                Traceback (most recent call last)
<ipython-input-10-e3057ceec2e6> in <module>()
----> 1 dls_block.show_batch()

19 frames
/usr/local/lib/python3.7/dist-packages/fastcore/foundation.py in <listcomp>(.0)
    117         return (self.items.iloc[list(i)] if hasattr(self.items,'iloc')
    118                 else self.items.__array__()[(i,)] if hasattr(self.items,'__array__')
--> 119                 else [self.items[i_] for i_ in i])
    120 
    121     def __setitem__(self, idx, o):

IndexError: list index out of range

I can, however, create a language_model_learner with this datablock and successfully call fit_one_cycle. But things go wrong again when I try and call learn.predict with

TEXT = "A new"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

and I get this error message:

ValueError                                Traceback (most recent call last)
<ipython-input-14-7599bd22d406> in <module>()
      3 N_SENTENCES = 2
      4 preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
----> 5          for _ in range(N_SENTENCES)]

1 frames
/usr/local/lib/python3.7/dist-packages/fastai/text/learner.py in predict(self, text, n_words, no_unk, temperature, min_p, no_bar, decoder, only_last_word)
    157         self.model.reset()
    158         idxs = idxs_all = self.dls.test_dl([text]).items[0].to(self.dls.device)
--> 159         if no_unk: unk_idx = self.dls.vocab.index(UNK)
    160         for _ in (range(n_words) if no_bar else progress_bar(range(n_words), leave=False)):
    161             with self.no_bar(): preds,_ = self.get_preds(dl=[(idxs[None],)])

ValueError: 'xxunk' is not in list

Any help or advice would be appreciated!
Thanks,
Alex

Your datablock is not built correctly. If you are building a datablock for a language model then therr should be only one block i.e. ‘Textblock’. While building a language model there are no targets (as the target for a language model is the next letter in the sequence). So, the CategoryBlock is not required.

Similarly, the get_y is also not required here as you don’t have a y i.e. target for a language model.

Refer to the tutorial
and try to build your datablock as was done in the link which I shared. Check if this helps you to build your datablock.

1 Like

Thank you! This solved the problem and helped me understand better how the language model works.

1 Like

My pleasure. I am glad that I was able to help :grinning: