General machine learning and data science tips

jeremy · October 30, 2016, 7:13pm

For anyone who is somewhat new to data science, I’d strongly suggest checking out this wonderful new paper, Good Enough Practices in Scientific Computing. All the practices recommended there will save you a lot of time and trouble as you continue to work on your projects over the coming weeks.

It’s also critical that you fully understand the nature of overfitting and underfitting in machine learning - a couple of excellent resources for this are:

Kaggle has some very thoughtfully designed processes to ensure that you avoid (or at least are aware of) over-fitting. Take a look at the Kaggle member FAQ to learn about these important tools and processes.

If you have any other suggested resources that may help the community with general machine learning and data science issues, for have any questions, please post below!

brendan · October 30, 2016, 7:29pm

How to approach machine learning problems

Classification/Regression Type Problems

Problem Definition: Understand and clearly describe the problem that is being solved.
Analyze Data: Understand the information available that will be used to develop a model.
Prepare Data: Discover and expose the structure in the dataset.
Evaluate Algorithms: Develop a robust test harness and baseline accuracy from which to improve and spot check algorithms.
Improve Results: Leverage results to develop more accurate models.
Present Results: Describe the problem and solution so that it can be understood by third parties.

jeff · October 30, 2016, 9:17pm

This may deserve its own topic, put how would we create the validation set differently if we knew that we have imbalanced classes? In other words, what if we were given many more photos of dogs than cats? Or, more realistically, in biomedical imaging, what if most of our images were of healthy patients rather than detecting cancer?

jeremy · October 31, 2016, 12:09am

@jeff it’s not just the validation set that is impacted by unbalanced classes, but the training too. Often the easiest thing to do is to over-sample your smaller group in both training and test set so that you have similar amounts of each groups. It depends however on the details of whether you care more about false positives or false negatives.

patrick.renschler · October 31, 2016, 4:44am

For anyone getting started I highly recommend Chapter 2 of An Introduction to Statistical Learning

What motivates statistical learning? (Prediction and/or Inference)
Tradeoff between prediction accuracy and interpretability
Supervised vs Unsupervised learning
Regression vs Classification problems
Assessing model accuracy (includes discussion on overfitting)

vshets · October 31, 2016, 5:54pm

For a visual intro to ML check this out: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
Intuitive reasoning behind DL (with lots of reference links) - https://adeshpande3.github.io/adeshpande3.github.io/

rachel · October 31, 2016, 8:45pm

@vshets That site is one of my favorites

ale · April 7, 2017, 4:46pm

In the case of heavy unbalanced classes, do you just copy paste your smaller group several times or what do you recommend to do?

jeremy · April 7, 2017, 4:50pm

If you’re using keras you can just use the ~~sample_weights~~class_weight parameter.

ale · April 7, 2017, 6:56pm

I need to add to the compile function the parameter like this? sample_weight_mode=‘temporal’? Or do you mean class_weight in the fit() function?

jeremy · April 7, 2017, 9:07pm

class_weight - sorry!

leviya · January 20, 2018, 5:55am

Thanks for the information. The information you provided is very helpful for ML users

zubairkhanzhk · November 12, 2018, 8:45pm

Very helpful Thread,
Actual;ly i have started my new company and i have hired data science company Quantilytics instead of hiring highly paid data scientist. They guys are working awesome.