Fastai v2 text

I think an extension module is best suited (will work on that example tomorrow). The ClassificationInterpretation class is mostly for plot_top_losses or confusion_matrix. I think it’s fine to have this one have its own class.

2 Likes

Sounds good, thanks!

Thanks for the quick turnaround!!

In this code:

def intrinsic_attention(learn, text, class_id=None):
  "Calculate the intrinsic attention of the input w.r.t to an output `class_id`, or the classification given by the model if `None`."
  learn.model.train()
  _eval_dropouts(learn.model)
  learn.model.zero_grad()
  learn.model.reset()
  dl = dls.test_dl([text])
  ids = dl.one_batch()[0]
  emb = learn.model[0].module.encoder(batch).detach().requires_grad_(True)
  lstm = learn.model[0].module(emb, True)
  learn.model.eval()
  cl = learn.model[1]((lstm, torch.zeros_like(batch).bool(),))[0].softmax(dim=-1)
  if class_id is None: class_id = cl.argmax()
  cl[0][class_id].backward()
  attn = emb.grad.squeeze().abs().sum(dim=-1)
  attn /= attn.max()
  tok, _ = learn.dls.decode_batch((*tuplify(batch), *tuplify(cl)))[0]
  return tok, attn

I got this error:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-62-139f078376c4> in <module>
----> 1 show_intrinsic_attention(learn,"Superman is the best superhero! No one will ever defeat him!")

<ipython-input-61-ec8033bf013f> in show_intrinsic_attention(learn, text, class_id, **kwargs)
     55 
     56 def show_intrinsic_attention(learn, text:str, class_id:int=None, **kwargs)->None:
---> 57     text, attn = intrinsic_attention(learn, text, class_id)
     58     show_piece_attn(text.split(), to_np(attn), **kwargs)

<ipython-input-61-ec8033bf013f> in intrinsic_attention(learn, text, class_id)
     15   learn.model.zero_grad()
     16   learn.model.reset()
---> 17   dl = dls.test_dl([text])
     18   ids = dl.one_batch()[0]
     19   emb = learn.model[0].module.encoder(batch).detach().requires_grad_(True)

NameError: name 'dls' is not defined

Which I fixed by changing this line:

dl = dls.test_dl([text])

to:

dl = learn.dls.test_dl([text])

Then I got:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-64-139f078376c4> in <module>
----> 1 show_intrinsic_attention(learn,"Superman is the best superhero! No one will ever defeat him!")

<ipython-input-63-86ff1b978f06> in show_intrinsic_attention(learn, text, class_id, **kwargs)
     55 
     56 def show_intrinsic_attention(learn, text:str, class_id:int=None, **kwargs)->None:
---> 57     text, attn = intrinsic_attention(learn, text, class_id)
     58     show_piece_attn(text.split(), to_np(attn), **kwargs)

<ipython-input-63-86ff1b978f06> in intrinsic_attention(learn, text, class_id)
     17   dl = learn.dls.test_dl([text])
     18   ids = dl.one_batch()[0]
---> 19   emb = learn.model[0].module.encoder(batch).detach().requires_grad_(True)
     20   lstm = learn.model[0].module(emb, True)
     21   learn.model.eval()

NameError: name 'batch' is not defined

Which I fixed by changing line 18:

ids = dl.one_batch()[0]

to:

batch = dl.one_batch()[0]

Since I didn’t see that “ids” was used anywhere. Now the output has all the words highlighted the same, and it’s showing nan outputs:

Here’s how I created the learners for both (I only use one of the DataBlocks at a time):

#This is for a normal category prediction, where only one can be correct.

imdb_clas = DataBlock(blocks=(TextBlock.from_df(['names'], vocab=dbunch.vocab), CategoryBlock),
                      get_x=attrgetter('text'),
                      get_y=attrgetter('number'),
                      splitter=TrainTestSplitter(test_size = 0.2, stratify=df_numbers['number'], random_state = 12))

#This is a regression. Use this to predict a floating point number.

imdb_clas = DataBlock(blocks=(TextBlock.from_df(['names'], vocab=dbunch.vocab), RegressionBlock),
                      get_x=attrgetter('text'),
                      get_y=attrgetter('number'),
                      splitter=TrainTestSplitter(test_size = 0.1, stratify=df_scores['number'], df=df_numbers, random_state = 24)
                      )

#For regressions

callbacks = [SaveModelCallback(),EarlyStoppingCallback(patience=3)]

learn = text_classifier_learner(dbunch_class, AWD_LSTM, drop_mult=0.5, loss_func=MSELossFlat(), wd = 0.1, y_range=(-3,3), cbs=callbacks).to_fp16()
learn = learn.load_encoder('finetuned6_208.pkl')

Thanks for the help!

If y’all will be working on an official implementation of this tomorrow, we’d also love to see this brought over from https://github.com/fastai/fastai/blob/master/fastai/text/interpret.py :slight_smile:


    def show_top_losses(self, k:int, max_len:int=70)->None:
        """
        Create a tabulation showing the first `k` texts in top_losses along with their prediction, actual,loss, and probability of
        actual class. `max_len` is the maximum number of tokens displayed.
        """
        from IPython.display import display, HTML
        items = []
        tl_val,tl_idx = self.top_losses()
        for i,idx in enumerate(tl_idx):
            if k <= 0: break
            k -= 1
            tx,cl = self.data.dl(self.ds_type).dataset[idx]
            cl = cl.data
            classes = self.data.classes
            txt = ' '.join(tx.text.split(' ')[:max_len]) if max_len is not None else tx.text
            tmp = [txt, f'{classes[self.pred_class[idx]]}', f'{classes[cl]}', f'{self.losses[idx]:.2f}',
                   f'{self.preds[idx][cl]:.2f}']
            items.append(tmp)
        items = np.array(items)
        names = ['Text', 'Prediction', 'Actual', 'Loss', 'Probability']
        df = pd.DataFrame({n:items[:,i] for i,n in enumerate(names)}, columns=names)
        with pd.option_context('display.max_colwidth', pd_max_colwidth()):
            display(HTML(df.to_html(index=False)))

Originally mentioned here:

It would be really useful in fastai2.

If not, no problem. We all appreciate your work @sgugger and @muellerzr !

It already works in fastai v2, across applications.

1 Like

I’ll debug this in a bit.

1 Like

FYI, just fixed a critical bug in WeightDropout (it basically was not working) so if you get unexpected changes in AWD LSTMs, it probably comes form that.

2 Likes

I am trying to replicate fastai2 text classification notebook on Kaggle Kernel, the TextDataLoaders generation tends to run on CPU even when GPU is enabled. Is the default setup as CPU for text api ?

Additional the kernel dies, because TextDataloader tries to use too much memory and processes. Is there a way to limit memory and core usage in Fastai2?

Finished fixing all the mess due to this bug and made things cleaner (there is no longer two parameters, one being a duplicate of the other for instance). That also means models trained previously don’t have the same parameters, so loading might be harder across versions of fastai. I added a function that should automatically convert those weights but it may fail.

In any case, since WeightDropout wasn’t working properly, models should be retrained, ideally (also it was creating issues with predictions).

3 Likes

Just updated my editable installs of fastai2 and fastcore, currently getting this message when trying to import all from fastai2.text:

from fastai2.text.all import *
from fastai2.tabular.all import *
pd.set_option("display.max_columns", 50)
import seaborn as sns
sns.set(style='whitegrid')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-360028788016> in <module>
----> 1 from fastai2.text.all import *
      2 from fastai2.tabular.all import *
      3 pd.set_option("display.max_columns", 50)
      4 import seaborn as sns
      5 sns.set(style='whitegrid')

~/development/_training/fastai2/fastai2/text/all.py in <module>
----> 1 from ..basics import *
      2 from ..callback.all import *
      3 from .core import *
      4 from .data import *
      5 from .models import *

~/development/_training/fastai2/fastai2/basics.py in <module>
----> 1 from .data.all import *
      2 from .optimizer import *
      3 from .callback.core import *
      4 from .learner import *
      5 from .metrics import *

~/development/_training/fastai2/fastai2/data/all.py in <module>
      1 from ..torch_basics import *
----> 2 from .core import *
      3 from .load import *
      4 from .external import *
      5 from .transforms import *

~/development/_training/fastai2/fastai2/data/core.py in <module>
     33 
     34 # Cell
---> 35 @log_args(but_as=DataLoader.__init__)
     36 @delegates()
     37 class TfmdDL(DataLoader):

TypeError: log_args() got an unexpected keyword argument 'but_as'

Anyone else getting this? Going to revert to an earlier fastai2/fastcore version for now.

It looks like you don’t have the latest version of fastcore.

3 Likes

Ah yes, I forgot to do a git pull :man_facepalming:

1 Like

Could someone help me with this:?

I am trying to load the following data into a TextDataLoader:

id comment_text toxic severe_toxic obscene threat insult identity_hate
0 0000997932d777bf Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren’t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don’t remove the template from the talk page since I’m retired now.89.205.38.27 0 0 0 0 0 0
1 000103f0d9cfb60f D’aww! He matches this background colour I’m seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC) 0 0 0 0 0 0
2 000113f07ec002fd Hey man, I’m really not trying to edit war. It’s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info. 0 0 0 0 0 0
3 0001b41b1c6bb37e "\nMore\nI can’t make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ““types of accidents”” -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It’s listed in the relevant form eg Wikipedia:Good_ar… 0 0 0 0 0 0
4 0001d958c54c6e35 You, sir, are my hero. Any chance you remember what page that’s on? 0 0 0 0 0 0

I use the following line of code:

dls = TextDataLoaders.from_csv(data_drive, csv_fname="train.csv",valid_pct=0.1,text_col="comment_text",label_col=["toxic","severe_toxic","obscene","threat","insult","identity_hate"])

I end up getting an error and I can’t find why:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-3c53bcdc9583> in <module>()
      1 default_device(None)
      2 
----> 3 dls = TextDataLoaders.from_csv(data_drive, csv_fname="train.csv",valid_pct=0.1,text_col="comment_text",label_col=["toxic","severe_toxic","obscene","threat","insult","identity_hate"])

16 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __getattr__(self, name)
   5272             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5273                 return self[name]
-> 5274             return object.__getattribute__(self, name)
   5275 
   5276     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'comment_text'

Check the examples here:

After the dataloader is done tokenizing the text, the text column is always labelled “text”

It’s asking for the tokenized text column, not the column in your original csv that has the text you want to analyze. That can be confusing, because in this example, the actual untokenized text was ALSO in a “text” column originally! I’ve never done this with .from_csv, when I import data from a csv, I first import it to a dataframe:

df = pd.read_csv('myamazing.csv', low_memory=False)

Then throw that into a datablock:

imdb_clas = DataBlock(blocks=(TextBlock.from_df(['description'], vocab=dbunch.vocab), RegressionBlock),
                          get_x=attrgetter('text'),
                          get_y=attrgetter('number')
                          )

And throw that into the dataloader:

dbunch_class = imdb_clas.dataloaders(df, bs=64, seq_len=80)

In this example, my csv has two columns:

description | number

Hope this helps!

1 Like

Thanks for the reply! Is this an error in the library then? It seems strange that the TextDataLoaders.from_csv command accept a text_col argument whereas the header of the text column should always be text.

1 Like

The default tokenizer sets the column to “text”, but you can write a custom one that sets it differently. Or you can import a csv with text that has already been tokenized outside of fastai, and it might have a different column name.

I don’t think it’s an error. Little things like this should become more clear once the latest courses are released, which use fastai2, and the fastai2 documentation is fleshed out.

1 Like

Hello!

Could someone help me with this?

What is the equivalent of label_from_df for fastai2?
I have a multicategory classification problem with data looking like this:

sample_submission.csv test.csv test_labels.csv train.csv

id comment_text toxic severe_toxic obscene threat insult identity_hate
0 0000997932d777bf Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren’t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don’t remove the template from the talk page since I’m retired now.89.205.38.27 0 0 0 0 0 0

I try to load the data with the following code:

dls = TextDataLoaders.from_df(df,data_drive,valid_pct=0.1,text_col="comment_text",label_col=["toxic","severe_toxic","obscene","threat","insult","identity_hate"])

But the library fails to correctly detect the labels

@hackerbear - Have a look at this https://dev.fast.ai/tutorial.datablock#Text-classification . Not the exact answer you are looking for.

imdb_clas = DataBlock(blocks=(TextBlock.from_df(‘text’, seq_len=72, vocab=dls.vocab), CategoryBlock),
get_x=ColReader(‘text’),
get_y=ColReader(‘label’),
splitter=ColSplitter())

1 Like

Thank you for this excellent intermediate wiki tutorial.

This is related to extending the wiki text tutorial to use SentencePiece & customizing it(for eg model_type as bpe).

I am following the exact steps described and it works great wrt using Mid Level API. Facing an issue wrt setting up the Transforms while I customize SentencePiece

Phase 1
tfms = [attrgetter('text'), Tokenizer.from_df(text_cols=0), Numericalize()] => Works fine since this uses default SpacyTokenizer

Phase 2
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer) => Works fine as well. It uses tokenizer function as SentencePieceTokenizer

Phase 3
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=partial(SentencePieceTokenizer, model_type='bpe')) => Facing AttributeError: ‘NoneType’ object has no attribute 'EncodeAsPieces’

This colab (commenting enabled) demonstrates the above issue with a starter example. My line of thinking is I am not using partial function correctly or I don’t know how to customize SentencePiece.

Any help is appreciated.

1 Like

Answering my own question. In order to customize the Tokenizer transform

sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer, model_type='bpe', vocab_sz=1000)
tfms = [attrgetter('text'), sent_tfm, Numericalize()]

Updated the colab as well.

2 Likes