Trash predictions for test set besides great values on dev set

So as the title says, I think my model gave back great results on the development set with:

Accuracy = 0.9230769230769231 , 
Confusion Matrix =
[[ 527   17    1   35]
 [  23  499    0   13]
 [   1    1  372   32]
 [  41   28   40 1386]]
F-Score:  0.9232024887594127
             precision    recall  f1-score   support

          0       0.89      0.91      0.90       580
          1       0.92      0.93      0.92       535
          2       0.90      0.92      0.91       406
          3       0.95      0.93      0.94      1495

avg / total       0.92      0.92      0.92      3016

But when I use the learn.predict(True) to get predictions for the test set and check the classification it’s absolutely useless. I already checked if the order of the predictions and the actual text is correct. I’m using a standart 10x test to validation ratio.
Any ideas why this is? Is there any way to get a better result?
The test set has a ratio of 4% 4% 4% 88% while the test & dev had a ratio of 18% 18% 18% 46%, could it have something to do with those ratios?

Thanks in advance

Leon, It can be either the ratios - your test set is not from the same distribution as validation and training sets.
That’s bad. Have a look at this video this time Andrew NG not Jeremy but he explains well why the distribution should be the same and he suggest things to do. S3w2:

Another possibility is that your Training set has common parts with Validation set, then your model is mostlikely overfiting ie memorizing the train and val sets, and not generalising so it does not work on test set.

Piotr, thanks for your answer. Maybe I’m also using the terms wrong. The data I have is from a competition, which offers a training set with labels and a dev/test set without. So when I say test set it’s just a set of example sentences to classify and then give in as a result. And I was thinking, that a good classifier should be able to classify sentences no matter the effective class distribution right?

The Validation is split of the Training Set, so there unique sets of data in both of them.
So I thought maybe I’m using the library wrong to get the predictions…

So I made my own function to predict all of my test results and realised that when passing the np.array of a sentence in to the model, it returns a classification for every single word of the sentence instead of the sentence as a whole and something that seems like the hidden state?

The function looks something like that and basically takes the one word that performs the strongest on one classification as a signal for the overall sentence class.

def predict_sentence(string):
    tok = Tokenizer().proc_text(string)
    array = np.array([[stoi[p] for p in tok]])
    return CLASSES[np.argmax(learn.predict_array(array)[0])%4]

digging through the code, I found that after multiple different prediction function calls, predict with targs also receives that classification with multiple entries (from def get_prediction(x): if is_listy(x): x=x[0] return ) and then puts those into a zip() function. From that just the first tuple is taken (preda,_ = predict_with_targs_(m, dl) ) and then concatenated. I don’t quite understand what happens with the data from the zip point on. What exactly does this function return, why does the predict() just take the first element of it?

And I was thinking, that a good classifier should be able to classify sentences no matter the effective class distribution right?

Yes, you are right this might be automated with time. But currently, there is no framework library that does this automatically as there are different ways to address the problem and it isn’t super clear which one should framework peak.

Try weighted loss or WeightedRandomSampler two different approaches but they are more or less equivalent. WeightedRandomSampler might be a bit better though.

re. zip, Jeremy is using a trick here, you are talking about this code right

def predict_with_targs_(m, dl):
    if hasattr(m, 'reset'): m.reset()
    res = []
    for *x,y in iter(dl): res.append([get_prediction(to_np(m(*VV(x)))),to_np(y)])
    return zip(*res)

Think what is in the res, it contains
[ [Prediction1, Label1], [Prediction2, Label2], [Prediction3, Label3], [Prediction4, Label4], [Prediction5, Label5] ....]

when you put res with * to zip, each element in the table is passed as a separate argument to zip. So this code:
zip (*res)
is equivalent to:
zip (res[0], res[1], res[2], .... , res[n])

So for the res as in example above it will return 2 tuples:
(Prediction1,Prediction2,Prediction3,Prediction4,Prediction5 …) , (Label1, label2, label3,…)

So if the first value is taken from the zip it means that you take the array / tuple with predictions and ignore the labels.

but don’t take my word for it and run debugger step in to the code and see for your self.

1 Like

I’m using weighted loss already for the development set, which decreased my loss values, but didn’t help with the overall f1-Score & Accuracy.

Thanks for the explanation!

The weighted loss and weighted random sampler should increase your loss not lower it. btw you sure your model isn’t overfitting? Maybe you show how the training went and provide us with the prediction confusion matrix and f1?

Oh, maybe I did something wrong then…
I calculated and implented the weights as described at the end of this thread

Just telling by the losses I would say it’s not overfitting, but I actually don’t know about other indicators for overfitting, what should I also pay attention to?

Training without weighted loss function:

100% 14/14 [12:50<00:00, 55.08s/it]
epoch      trn_loss   val_loss   accuracy                                                                              
    0      0.54588    0.347217   0.881631  
    1      0.47348    0.305623   0.900531                                                                              
    2      0.394069   0.276413   0.916114                                                                              
    3      0.37852    0.261475   0.919098                                                                              
    4      0.371019   0.228726   0.923077                                                                              
    5      0.303949   0.271239   0.919761                                                                              
    6      0.296667   0.254326   0.920093                                                                              
    7      0.306901   0.236277   0.925729                                                                              
    8      0.316285   0.237575   0.926393                                                                              
    9      0.314254   0.255249   0.918435                                                                              
    10     0.256666   0.260636   0.925066                                                                              
    11     0.216264   0.26039    0.924403                                                                              
    12     0.26061    0.239201   0.924403                                                                              
    13     0.261526   0.239923   0.924735                                                                              
[0.23992310960428784, 0.9247347658761933]


Confusion Matrix =
[[ 528   20    1   31]
 [  23  499    0   13]
 [   1    0  383   22]
 [  46   24   46 1379]]
F-Score:  0.9249622347273663
             precision    recall  f1-score   support

          0       0.88      0.91      0.90       580
          1       0.92      0.93      0.93       535
          2       0.89      0.94      0.92       406
          3       0.95      0.92      0.94      1495

avg / total       0.93      0.92      0.92      3016

With the weighted loss function:

100% 14/14 [12:19<00:00, 52.60s/it]
epoch      trn_loss   val_loss   accuracy                                                                              
    0      0.388642   0.276555   0.893568  
    1      0.378791   0.267754   0.90252                                                                               
    2      0.326992   0.247019   0.911804                                                                              
    3      0.284415   0.22867    0.915119                                                                              
    4      0.309813   0.207554   0.919761                                                                              
    5      0.290539   0.222173   0.921088                                                                              
    6      0.256909   0.243978   0.923409                                                                              
    7      0.243467   0.211369   0.920756                                                                              
    8      0.218482   0.229611   0.923077                                                                              
    9      0.22439    0.217223   0.926393                                                                              
    10     0.219688   0.205196   0.920093                                                                              
    11     0.225435   0.212244   0.92374                                                                               
    12     0.194243   0.227458   0.92374                                                                               
    13     0.200282   0.220978   0.925066  

It seems to overfit some with the weighted loss function after epoch 11


Accuracy = 0.9250663129973474 , 
Confusion Matrix =
[[ 527   17    1   35]
 [  19  502    0   14]
 [   1    0  369   36]
 [  39   26   38 1392]]
F-Score:  0.92512071253886
             precision    recall  f1-score   support

          0       0.90      0.91      0.90       580
          1       0.92      0.94      0.93       535
          2       0.90      0.91      0.91       406
          3       0.94      0.93      0.94      1495

avg / total       0.93      0.93      0.93      3016

Because I have no labels for the test set (the one performing awful score-wise on the submitting website) I can just show the results from the training to the development set. But by using the results from the learner.predict(True) function, I got a microaveraged f1 Score of 0.061038.

You are right your model doesn’t overfit too much it does a bit after 4th epoch and after 11th in the second example but that’s rather okey as the accuracy goes up. We have very similar problem with the GermEval 2018 where the model looks like it overfits loss wise but get’s increased accuracy.

I didn’t get this part though, how could you estimate f1 if you don’t have true labels? Is this Kaggle or something similar so that you don’t have access to the test data?

I had a look at your code and it seems that you did weight balancing in the reverse order. You gave more weight to the more frequent class and less weight to the less frequent class which is the opposite of what you want to do. The whole point of adding weights is to tell SGD that this class is less frequent but I care about it so threat it equal with other classes.

Your current weights are:
weight=tensor([0.1815, 0.1816, 0.1414, 0.4956],

Here is how you should do it:

basically, you should have a larger weight associated with less frequent class, you can either go with what is suggested in the article above or simply add multipliers you said that your ratio is 18% 18% 18% 46%, then I would go for [2.5, 2.5, 2.5, 1].

Yes exactly, it’s actually SemEval, so I have no access to test data.

But is there a way to give the classifier weights while predicting the outputs so it fits the ratio of 4,4,4,88 or should that happen automatically, because it will predict the results independent from the ratios in the train/dev set?

I fixed the weight function as you described in my other Thread and these are the new results I get:

100% 14/14 [12:26<00:00, 53.21s/it]
epoch      trn_loss   val_loss   accuracy                                                                              
    0      0.544569   0.33282    0.884284  
    1      0.489589   0.275779   0.903846                                                                              
    2      0.406101   0.262489   0.914456                                                                              
    3      0.402713   0.249655   0.914788                                                                              
    4      0.335877   0.261429   0.916777                                                                              
    5      0.359292   0.244448   0.920424                                                                              
    6      0.363325   0.237664   0.918767                                                                              
    7      0.288879   0.249687   0.921419                                                                              
    8      0.313193   0.241088   0.921751                                                                              
    9      0.307925   0.236088   0.922414                                                                              
    10     0.280689   0.240939   0.921088                                                                              
    11     0.297756   0.245683   0.920424                                                                              
    12     0.258553   0.242982   0.921751                                                                              
    13     0.250094   0.239602   0.923409                                                                              
[0.2396024797773804, 0.9234085071940638]

Confusion Matrix =
[[ 531   19    1   29]
 [  21  502    0   12]
 [   2    1  386   17]
 [  45   30   54 1366]]
F-Score:  0.9236959983933839
             precision    recall  f1-score   support

          0       0.89      0.92      0.90       580
          1       0.91      0.94      0.92       535
          2       0.88      0.95      0.91       406
          3       0.96      0.91      0.94      1495

avg / total       0.93      0.92      0.92      3016

I also made a run with something close to your suggested weights:

100% 14/14 [12:20<00:00, 52.84s/it]
epoch      trn_loss   val_loss   accuracy                                                                              
    0      0.493271   0.303197   0.872679  
    1      0.476202   0.312151   0.867042                                                                              
    2      0.457651   0.266932   0.908157                                                                              
    3      0.380498   0.257375   0.904509                                                                              
    4      0.347146   0.267776   0.911804                                                                              
    5      0.351558   0.236868   0.912467                                                                              
    6      0.334467   0.237001   0.916777                                                                              
    7      0.305457   0.225946   0.91744                                                                               
    8      0.320081   0.228317   0.91744                                                                               
    9      0.314233   0.217428   0.919761                                                                              
    10     0.354163   0.214708   0.920424                                                                              
    11     0.279616   0.220752   0.921751                                                                              
    12     0.302679   0.222988   0.92374                                                                               
    13     0.254521   0.218068   0.924403  

Confusion Matrix =
[[ 542   16    1   21]
 [  20  501    0   14]
 [   1    0  393   12]
 [  58   29   56 1352]]
F-Score:  0.9247502474178939
             precision    recall  f1-score   support

          0       0.87      0.93      0.90       580
          1       0.92      0.94      0.93       535
          2       0.87      0.97      0.92       406
          3       0.97      0.90      0.93      1495

avg / total       0.93      0.92      0.92      3016

Looking at the Confusing matrix and the overall report the weights don’t seem to change too much

Out of curiosity Which task?,

Out of my experience with Kaggle each time I’ve got crappy results like this I would mean that I messed up the submission file (labels in wrong order, wrong labels ids etc)

if you don’t want to share your results publicly, talk to me on DM here.

I’m super curious how ULMFiT is doing on SemEval and I’m happy to help.

1 Like

It’s pretty clear by now, which task I’m working by the class imbalance I talked about, so I don’t mind :slight_smile:
It’s Task 3, emocontext:

Yeah I thought so too at first, but if learn.predict(True) gives the predictions out in the order the data goes in, then everything should be right. As I said before, I created my own prediction method, by throwing my vector converted testset into the predict_array method, where I got the results back in form of prediction vectors for every single word instead of the whole sentence and then just took the max of those to predict the whole sentence. Which worked at least halfway decent with a score around 0.53.

I see, so you have 0.48 f1 score (or 0.53?) and this is a custom approach that predicts emotions on each word separately and then simply takes maximum to of each prediction to get a score for a sentence. The best score on the leader board so far is 0.72,

Ok now I understand what you mean when you have trash predictions. 0.06 looks like something that makes mistakes on purpose. I’m almost 100% that there has to be a bug somewhere in your pipeline.

I had a look at the evaluation details and I see that the class “other” isn’t taken into the consideration when F1 is calculated. If my calculations are ok you should get at leat 0.25 if you would submit a file with “happy” label on every sentence.

So the only way to get something below 0.25 is to predict “other” to often.

Maybe you could try to remove “Other” from predictions and simply replace it with the next largest logit? That way you should hit something above 0.25. and we could calculate if you are hitting all the emotions correctly.

Besides, I guess you have the F1 score wrongly calculated if you use simply scklearn, I think it will take the other class into the consideration and it will skew the score.
Try calculation by hand something along this lines

6 ( 1/(precision_score(happy) + 1/precision_score(sad) + 1/precision_score(angry) +
1/recall_score(happy) + 1/recall_score(sad) + 1/recall_score(angry))

And see what you get for your dev set.

0.48 was my last test for a custom prediction that turned out worse. So my current best is 0.53.


So I submitted a result with all entries of “others” exchanged with the second highest value and got 0.09773 as a result…

Oh yeah I just calculated standart f1 score, not the microaveraged f1 score based on the classes they use in the competition. I will write a custom function for that too then!

Edit: I think I don’t need a custom function scikit-learn actually offers excluding labels, so with:

f1_score(val_lbls_sampled, predictions, labels=[0,1,2], average="micro")

I get the microaverage only on the three emotions (label 3 is for others), with which I get an f1 score of 0.909 on the development set.

Thanks for your suggestions and your help so far!

Did you ensure that you aren’t shuffling your test set? I’ve had that issue before and that’s usually the first thing I check. Especially if validation scores look good.

I think you are incorrect here because when you create the data loader, if you are shuffling the test set, it will give them out randomly, but how would you match those back up to the actual record you are submitting? fname doesn’t shuffle when shuffle is done on the dataloader.


I just realized after checking over and over that I’m pushing my testset into the sortsampler, before I throw it into the dataloader, but if I don’t use a Sampler it gets ranomly sampled by the Dataloader:

    if batch_sampler is None:
        if sampler is None:
            sampler = RandomSampler(dataset) if shuffle else SequentialSampler(dataset)
        batch_sampler = BatchSampler(sampler, batch_size, drop_last)

But before that the dataset stays the way it is. So I’m not sure if the sampler might change the order when I output it? But checking the test_dl in the modelloader actually returns the first entry correctly with:

[itos[x] for x in np.swapaxes(next(iter(md.test_dl))[0].cpu().numpy(),1,0)[0]]

So I uploaded the testdataset with the fix and actually got a score of 0.657! Guess I just had to check often enough to actually find the mixup…

1 Like

Are you specifying shuffle=False on your dataloader? That would make it use the SequentialSampler instead of RandomSampler.

1 Like

I just removed the sampler setting, so it uses the default sampler. Where is the difference between randomsampler and sequentialsampler?
But I think the randomsampler is not mixing up the order anymore. At least with a weighted classifier I managed to get a score to rank 4th with my best result so far :slight_smile:

Edit: Made it to rank 1 by reducing the vocab size of the Target domain :smiley:


@KevinB Awesome find! I’m happy that this mystery is solved :slight_smile:

Funny enough I had exactly the same issue two days ago, with the SortSampler that we use during predictions.
I even spend some time to find a way to revert the changes made by SortSampler but I haven’t somehow connected this two!

Here is how to revert the SortSampler if you are still interested (sorting sentences speed up processing):

tst_ds = TextDataset(tst, lbl)
tst_samp = SortSampler(tst, key=lambda x: len(tst[x]))
tst_dl = DataLoader(tst_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=tst_samp)

res = predict_with_targs(m, tst_dl)
order = np.array(list(tst_samp)) # we save the order defined by sort sampler
true_y = res[1]
preds = np.argmax(res[0], axis=1)

return (res[0])[np.argsort(order)]  # we  us it to revese the sorting on results

@piotr.czapla, I was digging through your fastai_scripts/ script the other day and the first thing I noticed was this:

test_samp = SortSampler(test_ids, key=lambda x: len(test_ids[x]))

I wasn’t sure but it did seem a little odd.

Isn’t a much simpler way to run inference to simply pass sampler=None along with shuffle=False to the DataLoader?