Trash predictions for test set besides great values on dev set

Oh, maybe I did something wrong then…
I calculated and implented the weights as described at the end of this thread

Just telling by the losses I would say it’s not overfitting, but I actually don’t know about other indicators for overfitting, what should I also pay attention to?

Training without weighted loss function:

Epoch
100% 14/14 [12:50<00:00, 55.08s/it]
epoch      trn_loss   val_loss   accuracy                                                                              
    0      0.54588    0.347217   0.881631  
    1      0.47348    0.305623   0.900531                                                                              
    2      0.394069   0.276413   0.916114                                                                              
    3      0.37852    0.261475   0.919098                                                                              
    4      0.371019   0.228726   0.923077                                                                              
    5      0.303949   0.271239   0.919761                                                                              
    6      0.296667   0.254326   0.920093                                                                              
    7      0.306901   0.236277   0.925729                                                                              
    8      0.316285   0.237575   0.926393                                                                              
    9      0.314254   0.255249   0.918435                                                                              
    10     0.256666   0.260636   0.925066                                                                              
    11     0.216264   0.26039    0.924403                                                                              
    12     0.26061    0.239201   0.924403                                                                              
    13     0.261526   0.239923   0.924735                                                                              
[0.23992310960428784, 0.9247347658761933]

Metrics:

Confusion Matrix =
[[ 528   20    1   31]
 [  23  499    0   13]
 [   1    0  383   22]
 [  46   24   46 1379]]
F-Score:  0.9249622347273663
             precision    recall  f1-score   support

          0       0.88      0.91      0.90       580
          1       0.92      0.93      0.93       535
          2       0.89      0.94      0.92       406
          3       0.95      0.92      0.94      1495

avg / total       0.93      0.92      0.92      3016

With the weighted loss function:

Epoch
100% 14/14 [12:19<00:00, 52.60s/it]
epoch      trn_loss   val_loss   accuracy                                                                              
    0      0.388642   0.276555   0.893568  
    1      0.378791   0.267754   0.90252                                                                               
    2      0.326992   0.247019   0.911804                                                                              
    3      0.284415   0.22867    0.915119                                                                              
    4      0.309813   0.207554   0.919761                                                                              
    5      0.290539   0.222173   0.921088                                                                              
    6      0.256909   0.243978   0.923409                                                                              
    7      0.243467   0.211369   0.920756                                                                              
    8      0.218482   0.229611   0.923077                                                                              
    9      0.22439    0.217223   0.926393                                                                              
    10     0.219688   0.205196   0.920093                                                                              
    11     0.225435   0.212244   0.92374                                                                               
    12     0.194243   0.227458   0.92374                                                                               
    13     0.200282   0.220978   0.925066  

It seems to overfit some with the weighted loss function after epoch 11

Metrics:

Accuracy = 0.9250663129973474 , 
Confusion Matrix =
[[ 527   17    1   35]
 [  19  502    0   14]
 [   1    0  369   36]
 [  39   26   38 1392]]
F-Score:  0.92512071253886
             precision    recall  f1-score   support

          0       0.90      0.91      0.90       580
          1       0.92      0.94      0.93       535
          2       0.90      0.91      0.91       406
          3       0.94      0.93      0.94      1495

avg / total       0.93      0.93      0.93      3016

Because I have no labels for the test set (the one performing awful score-wise on the submitting website) I can just show the results from the training to the development set. But by using the results from the learner.predict(True) function, I got a microaveraged f1 Score of 0.061038.