For anyone who is somewhat new to data science, I’d strongly suggest checking out this wonderful new paper, Good Enough Practices in Scientific Computing. All the practices recommended there will save you a lot of time and trouble as you continue to work on your projects over the coming weeks.
It’s also critical that you fully understand the nature of overfitting and underfitting in machine learning - a couple of excellent resources for this are:
Kaggle has some very thoughtfully designed processes to ensure that you avoid (or at least are aware of) over-fitting. Take a look at the Kaggle member FAQ to learn about these important tools and processes.
If you have any other suggested resources that may help the community with general machine learning and data science issues, for have any questions, please post below!
This may deserve its own topic, put how would we create the validation set differently if we knew that we have imbalanced classes? In other words, what if we were given many more photos of dogs than cats? Or, more realistically, in biomedical imaging, what if most of our images were of healthy patients rather than detecting cancer?
@jeff it’s not just the validation set that is impacted by unbalanced classes, but the training too. Often the easiest thing to do is to over-sample your smaller group in both training and test set so that you have similar amounts of each groups. It depends however on the details of whether you care more about false positives or false negatives.
Very helpful Thread,
Actual;ly i have started my new company and i have hired data science company Quantilytics instead of hiring highly paid data scientist. They guys are working awesome.