FastAI2 ULMFiT model training reports a precision that I cannot reproduce in my testset

I’ve been training a classification model using FastAI2’s ULMFiT model. Into the trainer, I added an argument to report the accuracy, precision, recall and fbeta scores, like this:

learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, pretrained = True, metrics=[accuracy, Precision(), Recall(), FBeta(beta=1)]).to_fp16()
learn.load_encoder('finetuned_lm')

As I look at the performance of each epoch during the training, it reports precision scores of 85’ish percent and recall of 95’ish, which is suppose are metrics on the validation set, not the training set. However, if I then run my model on my test set, I’m only reaching a precision of 17.4%, but the recall is much more closer to the one during training, with a 98%.

Does anyone know what might be the case here? Both the training and test set are preprocessed exactly the same. For reference, a bigger code snippet:

df = pd.read_csv("Aggregated_Dataset_KEB_03-01_sent(corr).csv")
df = df.dropna()
df = df.reset_index(drop=True)
df = df.drop(["Unnamed: 0"], axis=1)
df['Class'] = df['Class'].astype(int)

temp_df, df_test = train_test_split(df[["Filename","Class", "Sentence"]], stratify = df['Class'], test_size = 0.2, random_state = 314)

### some code to rebalance the training classes to a 3:1 ratio ###

### training of the LM ###

df_trn["Class"].value_counts()

>> 0    2023
>> 1     697
>> Name: Class, dtype: int64

blocks = (TextBlock.from_df('Sentence', seq_len=dls_lm.seq_len, vocab=dls_lm.vocab), CategoryBlock())
dls = DataBlock(blocks=blocks,
                get_x=ColReader('text'),
                get_y=ColReader('Class'),
                splitter=RandomSplitter(0.2))

dls = dls.dataloaders(df_trn, bs=64)


learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, pretrained = True, metrics=[accuracy, Precision(), Recall(), FBeta(beta=1)]).to_fp16()
learn.load_encoder('finetuned_lm')

learn.fit_one_cycle(1, 1e-2)

>> epoch	train_loss	valid_loss	accuracy	precision_score	recall_score	fbeta_score	time
>> 0	    0.513772	0.376785	0.897059	0.818841	    0.784722	    0.801418	00:03

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-3/(2.6**4),1e-2))

>> epoch	train_loss	valid_loss	accuracy	precision_score	recall_score	fbeta_score	time
>> 0	    0.420189	0.289749	0.897059	0.752874	    0.909722	    0.823899	00:03

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),1e-2))

>> epoch	train_loss	valid_loss	accuracy	precision_score	recall_score	fbeta_score	time
>> 0	    0.323645	0.150523	0.943015	0.879195	    0.909722	    0.894198	00:04

learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),3e-3))

>> epoch	train_loss	valid_loss	accuracy	precision_score	recall_score	fbeta_score	time
>> 0		0.215019	0.131739	0.944853	0.851852		0.958333		0.901961	00:04
>> 1		0.172947	0.136240	0.944853	0.847561		0.965278		0.902597	00:04

learn.fit_one_cycle(5, slice(1e-3/(2.6**4),3e-3))

>> epoch	train_loss	valid_loss	accuracy	precision_score	recall_score	fbeta_score	time
>> 0		0.115063	0.125721	0.957721	0.885350		0.965278		0.923588	00:04
>> 1		0.110957	0.155260	0.943015	0.846626		0.958333		0.899023	00:04
>> 2		0.090381	0.121803	0.959559	0.896104		0.958333		0.926174	00:04
>> 3		0.069215	0.123623	0.959559	0.891026		0.965278		0.926667	00:04
>> 4		0.056123	0.135880	0.952206	0.868750		0.965278		0.914474	00:04

dl = learn.dls.test_dl(df_test['Sentence'])
preds, targets = learn.get_preds(dl=dl)
df_test["Preds"] = np.argmax(preds, axis =1)

FP = 0
FN = 0
TP = 0
TN = 0

for index, row in df_test.iterrows():
  if row.Class == row.Preds:
    if row.Class == 1:
      TP +=1
    else:
      TN +=1
  else:
    if row.Class == 1:
      FN +=1
    else:
      FP +=1

print(FP)
print(FN)
print(TP)
print(TN) 
  
>> 810
>> 3
>> 171
>> 16375

print("recall: ", TP / (TP + FN))
print("precision: ", TP / (TP + FP))

>> recall:  0.9827586206896551
>> precision:  0.1743119266055046**strong text**

Any help with this is much appreciated! Thanks in advance.