I am using the Phishing Websites’ Dataset to try building a phishing websites’ detector. I am doing it before transitioning to fastai. Was using TF-Keras back then.
For now, I am using the tabular_learner for the purpose and here is my experimentation notebook. The performance of the network does not seem to improve. I have used the entity embedding part as well as discussed in the v3 part I course. I have tried with a different number of layers, different learning rates and added weight decay as well (to prevent the validation loss).
Any directions would be much helpful.
Choosing the validation split in the right way indeed helped. A much better score now. Find the experiments here: https://github.com/sayakpaul/Phishing-Websites-Detection/blob/master/Experimentation_with_validation_splits.ipynb
It’s an interesting case that last X rows showed less accuracy than a random split
Looks like ‘tail data’ have some new information that main data doesn’t have. Do the data have some temporal dependency, maybe later row have accually been taken later in time?
Because maybe in this case it will be hard to say what accuracy you will get on real data (as they even later in time ).
I’m referring to this because I ran in such a problem. And at last I tend to make my validation set from 50% of last 20% of data, in order to get most of two worlds (get some new info from last rows as well as to check my accuracy on more recent data)
Thanks for your inputs @Pak. I checked with the authors of the dataset and they confirmed that there are no temporal dependencies in the data (I suspected this as well).
This one is staying with me for sure. Will keep experiment with this little trick and post if I find anything good.