Hi, I’m also working on trying to get multi-label text classification to work. This is what I have done so far, any help would be really appreciated
My data looks like this:
text |
label |
is_valid |
Lorem ipsum |
tag_01 | tag_02 |
False |
And I initialise the model and tokenizer with the following plus a few modifications to your FastHugsTokenizer
and FastHugsModel
functions.
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
transformer_tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
transformer_model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name)
Following the rest of your notebook works with single labels, but I’m having some issues trying to adapt it to handle multi-labels. I first get a list of all labels with
a = [x.split('|') for x in df.label]
chain = itertools.chain(*a)
b = list(set(list(chain)))
And then I create a dataset with
splits = ColSplitter()(df)
x_tfms = [attrgetter("text"), Tokenizer.from_df('text', fasthugstok), Numericalize(vocab=transformer_vocab)]
dsets = Datasets(df, splits=splits, tfms=[x_tfms, [ColReader('label', label_delim='|'), MultiCategorize(vocab=b)]], dl_type=SortedDL)
bs = 16
dls = dsets.dataloaders(bs=bs, device='cuda', before_batch=transformer_padding(transformer_tokenizer))
The problem is that although I do get something when I run dls.train_ds[0]
(TensorText([ 102, 4078, 30952,
...
157, 12264, 3937, 103]),
TensorMultiCategory([3]))
The dataloader is not working and I can’t start any training.
Could not do one pass in your dataloader, there is something wrong in it
My guess is that something is going on with tokenization and numericalize, as this works (with fastai’s own tokenization):
text_cols = ['text']
dsets = DataBlock(blocks=(TextBlock.from_df(text_cols), MultiCategoryBlock(vocab=b)),
get_x = [attrgetter('text')],
get_y = ColReader('label', label_delim='|'),
splitter = RandomSplitter(valid_pct=0.2),
dl_type = SortedDL,
)
bs = 16
dls = dsets.dataloaders(df,
bs=bs,
seq_len=80,
device='cuda',
before_batch=transformer_padding(transformer_tokenizer),
)
Update:
This seems to work:
new_df = df.copy()
new_df = pd.concat([new_df, new_df['label'].str.get_dummies(sep='|')], axis=1)
x_tfms = [
attrgetter('text'),
Tokenizer.from_df(text_cols='text', res_col_name='text', tok_func=fasthugstok),
Numericalize(vocab=transformer_vocab)
]
y_tfms = [
ColReader(b),
EncodedMultiCategorize(vocab=b)
]
dsets = Datasets(items=new_df,
tfms=[x_tfms, y_tfms],
splits=ColSplitter(col='is_valid')(new_df),
dl_type=SortedDL)
bs = 16
dls = dsets.dataloaders(
bs=bs,
# device='cuda',
device='cpu',
before_batch=transformer_padding(transformer_tokenizer),
)
...
opt_func = partial(Adam, decouple_wd=True)
cbs = [MixedPrecision(clip=0.1), SaveModelCallback()]
# loss = CrossEntropyLossFlat() #LabelSmoothingCrossEntropy
loss = nn.BCEWithLogitsLoss()
splitter = splitters[transformer_model.config.model_type]
learn = Learner(dls,
fasthugs_model,
opt_func=opt_func,
splitter=splitter,
loss_func=loss,
cbs=cbs,
metrics=[accuracy],
)
learn.fit_one_cycle(3, lr_max=1e-2)
But training happens in the CPU. What might be issue with CUDA here?
Thank you!
Regards