Another treat! Early access to Intro To Machine Learning videos

ecdrid · January 30, 2018, 2:22am

proc_df is filling the median always…

And I guess what you suggest isn’t actually done…
(Jeremy explained this later)

Just think that what your model thought about a particular year(let’s say it the split point) in the training set,
It will be completely different to what it will be validated on…

That will do nothing beneficial but might make the model collapse…(especially if the size of the splits after the year are let’s say in the ratio 9:1 and our model will give wrong predictions)

Thanks…

jeremy · January 30, 2018, 4:19am

cadolphs:

Thanks for that. But I would like to comment on that. I don’t think proc_df should be used on the test data frame in the way it is used. If there is a value missing in the test set, it should be filled with the median of the values in the training set, because that’s what we’ve based the model on.

As a side note and not meant as a criticism towards your approach: There’s probably some “smarter” way to fill in certain values. For example, we see that “GarageBuilt” is one of the features that has missing values, but instead of just filling with the median, it would probably make more sense to look at the median difference between when a house was built and when the garage was built. In the notebook, the median for garage built is 1980. The funny thing is that a missing GarageBuilt value would still be set to 1980 even if the house itself was built in a later year

Your concern is quite right, in a strict mathematical sense. For most real-world datasets (including this one) this won’t be an issue. If you do it at a more granular level however it can become an issue, and my friends Nina Zumel and John Mount have written an excellent paper and library about how to handle that situation if you’re interested: https://arxiv.org/abs/1611.09477

It’s always possible to do smarter feature engineering, but the trick is to know when it’s helpful and worth the investment of time. In this case, as you’ll see later in the course, creating a time difference variable doesn’t generally improve the predictive accuracy of a random forest, but can help with interpretation.

axelstram · January 30, 2018, 4:38am

Jeremy, in Lecture 7, approximately at minute 17:20, you talk about what to do when you have an unbalanced dataset, and you refer to a paper that found that oversampling the less common class was the best approach. Do you remember which paper it was?

jeremy · January 30, 2018, 4:39am

No I’m afraid not. If anyone digs it up let me know! It’s probably in my twitter favorites or retweets, so that would be a good place to search.

axelstram · January 30, 2018, 4:51am

I think I found the paper: https://arxiv.org/pdf/1710.05381.pdf

jeremy · January 30, 2018, 5:04am

Yes that’s it! Nice search-foo

vinay · January 31, 2018, 1:16pm

hi,jeremy.lesson 2 of this playlist is not working.

Brad_S · January 31, 2018, 2:27pm

saw this and thought some of you might like it. Not sure a better place to post it

arminys · February 1, 2018, 7:40am

Hi everyone, I just got started in this ML course and I’m currently stuck at this Subsampling cell, does anyone know how to resolve this error?

This error appears both on my Google VM and local desktop .__.

alwc · February 1, 2018, 10:55am

Looking at the API of the library, I think you have to change it to

df_trn, y_trn, _ = proc_df(df_raw, 'SalePrice').

ecdrid · February 1, 2018, 9:20pm

What is regularised target encoding?
Any tips on this?
@radek @jamesrequa @alessa (sorry all)

Is it like subtracting min values of each columns?

arminys · February 2, 2018, 1:31am

I already tried that myself, the cell will pass ok but later cells will yield different results compared to the videos

Furthermore, the error stated that it expected 2 arguments, so it’s really confusing.

ecdrid · February 2, 2018, 2:12am

What exactly you want to accomplish?

Have a look here

github.com

AdityaSoni19031997/Machine-Learning/blob/master/Classifying_datasets/Houses/house_kaggle.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "## Necessary Imports\n",

This file has been truncated. show original

alwc · February 2, 2018, 3:44am

Try

data = proc_df(df_raw, 'SalePrice')
df_trn, y_trn = data[0], data[1]

arminys · February 2, 2018, 11:40am

Hey this works, in fact df_trn, y_trn, _ = proc_df(df_raw, 'SalePrice') works too after I restarted my PC @_@. Thanks a lot.

SwethaGodi · February 2, 2018, 2:16pm

Hi everyone, I just got started in this ML course. Can someone help me understand the parameters passed in the below functions.

1.def fit(self, X, y, sample_weight=None): What are X and y here?

m.fit(df, y) : what are df and y?
def print_score(m):
res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
m.score(X_train, y_train), m.score(X_valid, y_valid)]
if hasattr(m, ‘oob_score_’): res.append(m.oob_score_)
print(res)

What are X-train, y_train here?

Thanks in advance!

ecdrid · February 2, 2018, 3:39pm

I guess you should dig in books a little bit as these are trivial notations used always in ML…(don’t take it otherwise)

DF is dataframe
Xtrain …etc are the training, validation, testing…
y here is the target variable in np darray

jeremy · February 2, 2018, 5:49pm

This is designed to be a standalone intro to machine learning - we shouldn’t be asking people to read other books to understand it! It sounds like we may need to add more information to the notebooks to help people interpret them.

ecdrid · February 2, 2018, 6:08pm

Actually I am working on collecting different shorthands fast.ai uses…

Will share once I gather enough…

rishaanp · February 3, 2018, 12:21am

Is low accuracy when you OHE all variables because each tree selects a random subset of the features? Each tree would have less information to learn from.