Hi, I apologize in advance if these questions are already asked elsewhere. I’m using the DataBlock api to get 3 columns of my dataset together with mark_fields=True
, and then do boolean classification. Two questions:
- I found a way to make this work (predict sthg), but feels like a hack. Is there a better way to create the data with the processed column?
- I see different metrics in dev when training the model compared to using dev as a test set. Is this maybe related to dropout?
Any input will be appreciated, and please feel free to comment on anything else apart from my specific questions. Still learning about the new API
Thanks
My code
text_block = TextBlock.from_df(
text_cols=INPUT_COLUMNS,
is_lm=False,
seq_len=1_000,
tok=None,
# add xxfld between fields
mark_fields=True,
# name for the output column
tok_text_col='ulmfit_text',
vocab=torch.load("data_lm_vocab")
)
data_cls = DataBlock(
blocks=(text_block, CategoryBlock),
get_x=ColReader("ulmfit_text"),
get_y=ColReader("label"),
splitter=ColSplitter("is_dev"),
)
data_cls = data_cls.dataloaders(
pd.concat([
df_train.assign(is_dev=False),
df_dev.assign(is_dev=True),
]),
shuffle_train=True,
bs=64,
)
Then I train the model as in
cls_learner = text_classifier_learner(
data_cls,
AWD_LSTM,
drop_mult=0.5,
metrics=[Precision(), Recall(), F1Score(), RocAucBinary()]
)
cls_learner.load_encoder("1epoch_encoder")
cls_learner.fine_tune(4, lr_max=0.5)
cls_learner.export("classifier_with_finetune")
Now if I restart the kernel and want to predict again on the dev set, I can do
# this loads the learner but not the dls
cls_learner = load_learner("classifier_with_finetune")
cls_learner.dls = data_cls
probas, targets, preds = cls_learner.get_preds(ds_idx=1, with_decoded=True)
print(classification_report(targets, preds))
and I get the same metrics I see when training. Good
I now want to predict on a given test set. To verify that I’m doing everything alright, I will use the dev as the test set, to check that I’m getting the exact same metrics I saw on the two reports above. The problem is that when doing
dl = cls_learner.dls.test_dl(df_dev, with_labels=True)
it complains with
ulmfit_text column not found
so it seems like it’s not doing the operations defined in the DataBlock. I wonder if there’s a better way to do this. The workaround (hack) I found is
data_cls_test = DataBlock(
blocks=(text_block, CategoryBlock),
get_x=ColReader("ulmfit_text"),
get_y=ColReader("label"),
)
# TODO: is there a better way to do this? split to then concat seems absurd ...
dl_test = data_cls_test.dataloaders(df_dev, bs=128)
tmp = pd.concat([ dl_test[0].items, dl_test[1].items]).sort_index()
dl = cls_learner.dls.test_dl(tmp, with_labels=True)
test_probas, test_targets, test_preds = cls_learner.get_preds(dl=dl, with_decoded=True)
print(classification_report(test_targets.numpy().astype(int), test_preds.numpy().astype(int)))
Unfortunately, I don’t get the same result, but I can see how test_targets
is the same as targets
, so I’m quite confident the data is the same and in the same order.