General machine learning and data science tips


(Jeremy Howard (Admin)) #1

For anyone who is somewhat new to data science, I’d strongly suggest checking out this wonderful new paper, Good Enough Practices in Scientific Computing. All the practices recommended there will save you a lot of time and trouble as you continue to work on your projects over the coming weeks.

It’s also critical that you fully understand the nature of overfitting and underfitting in machine learning - a couple of excellent resources for this are:

Kaggle has some very thoughtfully designed processes to ensure that you avoid (or at least are aware of) over-fitting. Take a look at the Kaggle member FAQ to learn about these important tools and processes.

If you have any other suggested resources that may help the community with general machine learning and data science issues, for have any questions, please post below!


Criteria for Machine Learn-able Problems
(Brendan Fortuner) #2

How to approach machine learning problems

Classification/Regression Type Problems

  1. Problem Definition: Understand and clearly describe the problem that is being solved.
  2. Analyze Data: Understand the information available that will be used to develop a model.
  3. Prepare Data: Discover and expose the structure in the dataset.
  4. Evaluate Algorithms: Develop a robust test harness and baseline accuracy from which to improve and spot check algorithms.
  5. Improve Results: Leverage results to develop more accurate models.
  6. Present Results: Describe the problem and solution so that it can be understood by third parties.


#3

This may deserve its own topic, put how would we create the validation set differently if we knew that we have imbalanced classes? In other words, what if we were given many more photos of dogs than cats? Or, more realistically, in biomedical imaging, what if most of our images were of healthy patients rather than detecting cancer?


(Jeremy Howard (Admin)) #4

@jeff it’s not just the validation set that is impacted by unbalanced classes, but the training too. Often the easiest thing to do is to over-sample your smaller group in both training and test set so that you have similar amounts of each groups. It depends however on the details of whether you care more about false positives or false negatives.


(patrick.renschler) #5

For anyone getting started I highly recommend Chapter 2 of An Introduction to Statistical Learning

  • What motivates statistical learning? (Prediction and/or Inference)
  • Tradeoff between prediction accuracy and interpretability
  • Supervised vs Unsupervised learning
  • Regression vs Classification problems
  • Assessing model accuracy (includes discussion on overfitting)

(vedshetty) #6

For a visual intro to ML check this out: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
Intuitive reasoning behind DL (with lots of reference links) - https://adeshpande3.github.io/adeshpande3.github.io/


(Rachel Thomas) #7

@vshets That site is one of my favorites :slight_smile:


(Alejandro) #8

In the case of heavy unbalanced classes, do you just copy paste your smaller group several times or what do you recommend to do?


(Jeremy Howard (Admin)) #9

If you’re using keras you can just use the sample_weightsclass_weight parameter.


(Alejandro) #10

I need to add to the compile function the parameter like this? sample_weight_mode=‘temporal’? Or do you mean class_weight in the fit() function?


(Jeremy Howard (Admin)) #11

class_weight - sorry!


(Leviya Bl) #12

Thanks for the information. The information you provided is very helpful for ML users


(zubair khan) #13

Very helpful Thread,
Actual;ly i have started my new company and i have hired data science company Quantilytics instead of hiring highly paid data scientist. They guys are working awesome.