I get this in google colab but once i submit for the kaggle competition it becomes this:
The competition is:https://www.kaggle.com/c/ieee-fraud-detection/overview
I really dont know what to do now
This is how i created the databunch :
Kaggle’s leaderboard and accuracy score is based off of data that you have not been training on. For most models it will differ to a certain extent. Since there is such a large disparity for you, I would assume that you are overfitting your model to your training data.
What should i do to solve this issue? How can i confirm if it is overfitting?
It is overfitting because you got good result on your data - 0.99 on your accuracy metric. The model did good on what you gave it. However the test set of Kaggle might be too different on the training set so you got bad result.
oh so the accuracy and auroc score is based on the training data and not the test data that was also given?
Correct. Kaggle operates with 3 sets in total, a training, and two test sets. One goes to a private leaderboard, the other public, both of which we don’t have any access to (to make it fair.) So most likely while you overfit your model possibly on the training set they gave, it does not perform well on their test sets as their test sets are hand crafted to be difficult and proper to judge on. Does this help? Most likely you made your model right and all, just Kaggles are designed to be challenging and tough.
Split_by_idx 800,1000? I might be crazy but that does not seem random. Are you sure 800-1000 is a good sample of your data? That would definitely throw off your results. Generally you want to split your training and validation data randomly over your non-test dataset.