I think an extension module is best suited (will work on that example tomorrow). The ClassificationInterpretation
class is mostly for plot_top_losses
or confusion_matrix
. I think it’s fine to have this one have its own class.
Sounds good, thanks!
Thanks for the quick turnaround!!
In this code:
def intrinsic_attention(learn, text, class_id=None):
"Calculate the intrinsic attention of the input w.r.t to an output `class_id`, or the classification given by the model if `None`."
learn.model.train()
_eval_dropouts(learn.model)
learn.model.zero_grad()
learn.model.reset()
dl = dls.test_dl([text])
ids = dl.one_batch()[0]
emb = learn.model[0].module.encoder(batch).detach().requires_grad_(True)
lstm = learn.model[0].module(emb, True)
learn.model.eval()
cl = learn.model[1]((lstm, torch.zeros_like(batch).bool(),))[0].softmax(dim=-1)
if class_id is None: class_id = cl.argmax()
cl[0][class_id].backward()
attn = emb.grad.squeeze().abs().sum(dim=-1)
attn /= attn.max()
tok, _ = learn.dls.decode_batch((*tuplify(batch), *tuplify(cl)))[0]
return tok, attn
I got this error:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-62-139f078376c4> in <module>
----> 1 show_intrinsic_attention(learn,"Superman is the best superhero! No one will ever defeat him!")
<ipython-input-61-ec8033bf013f> in show_intrinsic_attention(learn, text, class_id, **kwargs)
55
56 def show_intrinsic_attention(learn, text:str, class_id:int=None, **kwargs)->None:
---> 57 text, attn = intrinsic_attention(learn, text, class_id)
58 show_piece_attn(text.split(), to_np(attn), **kwargs)
<ipython-input-61-ec8033bf013f> in intrinsic_attention(learn, text, class_id)
15 learn.model.zero_grad()
16 learn.model.reset()
---> 17 dl = dls.test_dl([text])
18 ids = dl.one_batch()[0]
19 emb = learn.model[0].module.encoder(batch).detach().requires_grad_(True)
NameError: name 'dls' is not defined
Which I fixed by changing this line:
dl = dls.test_dl([text])
to:
dl = learn.dls.test_dl([text])
Then I got:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-64-139f078376c4> in <module>
----> 1 show_intrinsic_attention(learn,"Superman is the best superhero! No one will ever defeat him!")
<ipython-input-63-86ff1b978f06> in show_intrinsic_attention(learn, text, class_id, **kwargs)
55
56 def show_intrinsic_attention(learn, text:str, class_id:int=None, **kwargs)->None:
---> 57 text, attn = intrinsic_attention(learn, text, class_id)
58 show_piece_attn(text.split(), to_np(attn), **kwargs)
<ipython-input-63-86ff1b978f06> in intrinsic_attention(learn, text, class_id)
17 dl = learn.dls.test_dl([text])
18 ids = dl.one_batch()[0]
---> 19 emb = learn.model[0].module.encoder(batch).detach().requires_grad_(True)
20 lstm = learn.model[0].module(emb, True)
21 learn.model.eval()
NameError: name 'batch' is not defined
Which I fixed by changing line 18:
ids = dl.one_batch()[0]
to:
batch = dl.one_batch()[0]
Since I didn’t see that “ids” was used anywhere. Now the output has all the words highlighted the same, and it’s showing nan outputs:
Here’s how I created the learners for both (I only use one of the DataBlocks at a time):
#This is for a normal category prediction, where only one can be correct.
imdb_clas = DataBlock(blocks=(TextBlock.from_df(['names'], vocab=dbunch.vocab), CategoryBlock),
get_x=attrgetter('text'),
get_y=attrgetter('number'),
splitter=TrainTestSplitter(test_size = 0.2, stratify=df_numbers['number'], random_state = 12))
#This is a regression. Use this to predict a floating point number.
imdb_clas = DataBlock(blocks=(TextBlock.from_df(['names'], vocab=dbunch.vocab), RegressionBlock),
get_x=attrgetter('text'),
get_y=attrgetter('number'),
splitter=TrainTestSplitter(test_size = 0.1, stratify=df_scores['number'], df=df_numbers, random_state = 24)
)
#For regressions
callbacks = [SaveModelCallback(),EarlyStoppingCallback(patience=3)]
learn = text_classifier_learner(dbunch_class, AWD_LSTM, drop_mult=0.5, loss_func=MSELossFlat(), wd = 0.1, y_range=(-3,3), cbs=callbacks).to_fp16()
learn = learn.load_encoder('finetuned6_208.pkl')
Thanks for the help!
If y’all will be working on an official implementation of this tomorrow, we’d also love to see this brought over from https://github.com/fastai/fastai/blob/master/fastai/text/interpret.py
def show_top_losses(self, k:int, max_len:int=70)->None:
"""
Create a tabulation showing the first `k` texts in top_losses along with their prediction, actual,loss, and probability of
actual class. `max_len` is the maximum number of tokens displayed.
"""
from IPython.display import display, HTML
items = []
tl_val,tl_idx = self.top_losses()
for i,idx in enumerate(tl_idx):
if k <= 0: break
k -= 1
tx,cl = self.data.dl(self.ds_type).dataset[idx]
cl = cl.data
classes = self.data.classes
txt = ' '.join(tx.text.split(' ')[:max_len]) if max_len is not None else tx.text
tmp = [txt, f'{classes[self.pred_class[idx]]}', f'{classes[cl]}', f'{self.losses[idx]:.2f}',
f'{self.preds[idx][cl]:.2f}']
items.append(tmp)
items = np.array(items)
names = ['Text', 'Prediction', 'Actual', 'Loss', 'Probability']
df = pd.DataFrame({n:items[:,i] for i,n in enumerate(names)}, columns=names)
with pd.option_context('display.max_colwidth', pd_max_colwidth()):
display(HTML(df.to_html(index=False)))
Originally mentioned here:
It would be really useful in fastai2.
If not, no problem. We all appreciate your work @sgugger and @muellerzr !
It already works in fastai v2, across applications.
I’ll debug this in a bit.
FYI, just fixed a critical bug in WeightDropout
(it basically was not working) so if you get unexpected changes in AWD LSTMs, it probably comes form that.
I am trying to replicate fastai2 text classification notebook on Kaggle Kernel, the TextDataLoaders generation tends to run on CPU even when GPU is enabled. Is the default setup as CPU for text api ?
Additional the kernel dies, because TextDataloader tries to use too much memory and processes. Is there a way to limit memory and core usage in Fastai2?
Finished fixing all the mess due to this bug and made things cleaner (there is no longer two parameters, one being a duplicate of the other for instance). That also means models trained previously don’t have the same parameters, so loading might be harder across versions of fastai. I added a function that should automatically convert those weights but it may fail.
In any case, since WeightDropout wasn’t working properly, models should be retrained, ideally (also it was creating issues with predictions).
Just updated my editable installs of fastai2
and fastcore
, currently getting this message when trying to import all from fastai2.text:
from fastai2.text.all import *
from fastai2.tabular.all import *
pd.set_option("display.max_columns", 50)
import seaborn as sns
sns.set(style='whitegrid')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-360028788016> in <module>
----> 1 from fastai2.text.all import *
2 from fastai2.tabular.all import *
3 pd.set_option("display.max_columns", 50)
4 import seaborn as sns
5 sns.set(style='whitegrid')
~/development/_training/fastai2/fastai2/text/all.py in <module>
----> 1 from ..basics import *
2 from ..callback.all import *
3 from .core import *
4 from .data import *
5 from .models import *
~/development/_training/fastai2/fastai2/basics.py in <module>
----> 1 from .data.all import *
2 from .optimizer import *
3 from .callback.core import *
4 from .learner import *
5 from .metrics import *
~/development/_training/fastai2/fastai2/data/all.py in <module>
1 from ..torch_basics import *
----> 2 from .core import *
3 from .load import *
4 from .external import *
5 from .transforms import *
~/development/_training/fastai2/fastai2/data/core.py in <module>
33
34 # Cell
---> 35 @log_args(but_as=DataLoader.__init__)
36 @delegates()
37 class TfmdDL(DataLoader):
TypeError: log_args() got an unexpected keyword argument 'but_as'
Anyone else getting this? Going to revert to an earlier fastai2/fastcore version for now.
It looks like you don’t have the latest version of fastcore.
Ah yes, I forgot to do a git pull
Could someone help me with this:?
I am trying to load the following data into a TextDataLoader:
id | comment_text | toxic | severe_toxic | obscene | threat | insult | identity_hate | |
---|---|---|---|---|---|---|---|---|
0 | 0000997932d777bf | Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren’t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don’t remove the template from the talk page since I’m retired now.89.205.38.27 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 000103f0d9cfb60f | D’aww! He matches this background colour I’m seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC) | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 000113f07ec002fd | Hey man, I’m really not trying to edit war. It’s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info. | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0001b41b1c6bb37e | "\nMore\nI can’t make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ““types of accidents”” -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It’s listed in the relevant form eg Wikipedia:Good_ar… | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0001d958c54c6e35 | You, sir, are my hero. Any chance you remember what page that’s on? | 0 | 0 | 0 | 0 | 0 | 0 |
I use the following line of code:
dls = TextDataLoaders.from_csv(data_drive, csv_fname="train.csv",valid_pct=0.1,text_col="comment_text",label_col=["toxic","severe_toxic","obscene","threat","insult","identity_hate"])
I end up getting an error and I can’t find why:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-3c53bcdc9583> in <module>()
1 default_device(None)
2
----> 3 dls = TextDataLoaders.from_csv(data_drive, csv_fname="train.csv",valid_pct=0.1,text_col="comment_text",label_col=["toxic","severe_toxic","obscene","threat","insult","identity_hate"])
16 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __getattr__(self, name)
5272 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5273 return self[name]
-> 5274 return object.__getattribute__(self, name)
5275
5276 def __setattr__(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute 'comment_text'
Check the examples here:
After the dataloader is done tokenizing the text, the text column is always labelled “text”
It’s asking for the tokenized text column, not the column in your original csv that has the text you want to analyze. That can be confusing, because in this example, the actual untokenized text was ALSO in a “text” column originally! I’ve never done this with .from_csv, when I import data from a csv, I first import it to a dataframe:
df = pd.read_csv('myamazing.csv', low_memory=False)
Then throw that into a datablock:
imdb_clas = DataBlock(blocks=(TextBlock.from_df(['description'], vocab=dbunch.vocab), RegressionBlock),
get_x=attrgetter('text'),
get_y=attrgetter('number')
)
And throw that into the dataloader:
dbunch_class = imdb_clas.dataloaders(df, bs=64, seq_len=80)
In this example, my csv has two columns:
description | number
Hope this helps!
Thanks for the reply! Is this an error in the library then? It seems strange that the TextDataLoaders.from_csv
command accept a text_col
argument whereas the header of the text column should always be text.
The default tokenizer sets the column to “text”, but you can write a custom one that sets it differently. Or you can import a csv with text that has already been tokenized outside of fastai, and it might have a different column name.
I don’t think it’s an error. Little things like this should become more clear once the latest courses are released, which use fastai2, and the fastai2 documentation is fleshed out.
Hello!
Could someone help me with this?
What is the equivalent of label_from_df for fastai2?
I have a multicategory classification problem with data looking like this:
sample_submission.csv test.csv test_labels.csv train.csv
id | comment_text | toxic | severe_toxic | obscene | threat | insult | identity_hate | |
---|---|---|---|---|---|---|---|---|
0 | 0000997932d777bf | Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren’t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don’t remove the template from the talk page since I’m retired now.89.205.38.27 | 0 | 0 | 0 | 0 | 0 | 0 |
I try to load the data with the following code:
dls = TextDataLoaders.from_df(df,data_drive,valid_pct=0.1,text_col="comment_text",label_col=["toxic","severe_toxic","obscene","threat","insult","identity_hate"])
But the library fails to correctly detect the labels
@hackerbear - Have a look at this https://dev.fast.ai/tutorial.datablock#Text-classification . Not the exact answer you are looking for.
imdb_clas = DataBlock(blocks=(TextBlock.from_df(‘text’, seq_len=72, vocab=dls.vocab), CategoryBlock),
get_x=ColReader(‘text’),
get_y=ColReader(‘label’),
splitter=ColSplitter())
Thank you for this excellent intermediate wiki tutorial.
This is related to extending the wiki text tutorial to use SentencePiece & customizing it(for eg model_type as bpe).
I am following the exact steps described and it works great wrt using Mid Level API. Facing an issue wrt setting up the Transforms while I customize SentencePiece
Phase 1
tfms = [attrgetter('text'), Tokenizer.from_df(text_cols=0), Numericalize()]
=> Works fine since this uses default SpacyTokenizer
Phase 2
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer)
=> Works fine as well. It uses tokenizer function as SentencePieceTokenizer
Phase 3
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=partial(SentencePieceTokenizer, model_type='bpe'))
=> Facing AttributeError: ‘NoneType’ object has no attribute 'EncodeAsPieces’
This colab (commenting enabled) demonstrates the above issue with a starter example. My line of thinking is I am not using partial function correctly or I don’t know how to customize SentencePiece.
Any help is appreciated.
Answering my own question. In order to customize the Tokenizer transform
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer, model_type='bpe', vocab_sz=1000)
tfms = [attrgetter('text'), sent_tfm, Numericalize()]
Updated the colab as well.