Another treat! Early access to Intro To Machine Learning videos

(Joao Ramos) #768

I did a quick running example using the bulldozers dataset and RFs. I tried to show the before/after of each step.

(Naveed Unjum) #769

I app-ly apply_cats to the test set. Then how do i change those categorical values to numbers as proc_df is only for the train set with target variable? Also if you could please share one of the kernels with this application

(Aditya) #770

via and assigning them to the data frame…
It’s in the notebook .


i am getting the same issue at paperspace today, strangely yesterday it run fine, hmm… what is the issue and what could be the fix?

my setup:
RAM: 30 GB
HD: 34.7 GB / 250 GB


managed to get it work, had to comment out the save to feather for some reason…


I am getting memory error when executing proc_df on New York City Taxi Fare Prediction training data, which is about 55M rows, 7 columns. Is there any way to make it work at the machine i am using at paperspace (spec below)? Or do i need to upgrade to machine with more memory?

Paperspace machine setup:
RAM: 30 GB
HD: 34.7 GB / 250 GB


df, y, nas = proc_df(df_raw, 'fare_amount')

error details:

MemoryError                               Traceback (most recent call last)
~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _getbool_axis(self, key, axis)
   1495         try:
-> 1496             return self.obj._take(inds, axis=axis)
   1497         except Exception as detail:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _take(self, indices, axis, is_copy)
   2784     def _take(self, indices, axis=0, is_copy=True):
-> 2785         self._consolidate_inplace()

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _consolidate_inplace(self)
-> 4439         self._protect_consolidate(f)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _protect_consolidate(self, f)
   4427         blocks_before = len(self._data.blocks)
-> 4428         result = f()
   4429         if len(self._data.blocks) != blocks_before:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in f()
   4436         def f():
-> 4437             self._data = self._data.consolidate()

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in consolidate(self)
   4097         bm._is_consolidated = False
-> 4098         bm._consolidate_inplace()
   4099         return bm

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _consolidate_inplace(self)
   4102         if not self.is_consolidated():
-> 4103             self.blocks = tuple(_consolidate(self.blocks))
   4104             self._is_consolidated = True

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _consolidate(blocks)
   5068         merged_blocks = _merge_blocks(list(group_blocks), dtype=dtype,
-> 5069                                       _can_consolidate=_can_consolidate)
   5070         new_blocks = _extend_blocks(merged_blocks, new_blocks)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _merge_blocks(blocks, dtype, _can_consolidate)
   5091         argsort = np.argsort(new_mgr_locs)
-> 5092         new_values = new_values[argsort]
   5093         new_mgr_locs = new_mgr_locs[argsort]


During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-23-35a73d8cb330> in <module>()
----> 1 df, y, nas = proc_df(df_raw, 'fare_amount')

~/fastai/courses/ml1/fastai/ in proc_df(df, y_fld, skip_flds, ignore_flds, do_scale, na_dict, preproc_fn, max_n_cat, subset, mapper)
    445     if do_scale: mapper = scale_vars(df, mapper)
    446     for n,c in df.items(): numericalize(df, c, n, max_n_cat)
--> 447     df = pd.get_dummies(df, dummy_na=True)
    448     df = pd.concat([ignored_flds, df], axis=1)
    449     res = [df, y, na_dict]

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/reshape/ in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
    840         if columns is None:
    841             data_to_encode = data.select_dtypes(
--> 842                 include=dtypes_to_encode)
    843         else:
    844             data_to_encode = data[columns]

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in select_dtypes(self, include, exclude)
   3090         dtype_indexer = include_these & exclude_these
-> 3091         return self.loc[com._get_info_slice(self, dtype_indexer)]
   3093     def _box_item_values(self, key, values):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             # we by definition only have the 0th axis

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _getitem_tuple(self, tup)
    888                 continue
--> 890             retval = getattr(retval,, axis=i)
    892         return retval

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _getitem_axis(self, key, axis)
   1866             return self._get_slice_axis(key, axis=axis)
   1867         elif com.is_bool_indexer(key):
-> 1868             return self._getbool_axis(key, axis=axis)
   1869         elif is_list_like_indexer(key):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _getbool_axis(self, key, axis)
   1496             return self.obj._take(inds, axis=axis)
   1497         except Exception as detail:
-> 1498             raise self._exception(detail)
   1500     def _get_slice_axis(self, slice_obj, axis=None):

KeyError: MemoryError()

(Luke Byrne) #775

Hi all,

Can someone if they have the time have a look at this Gist. I am using the Random Forrest method and applying it to the ‘Kaggle - House Prices: Advanced Regression Techniques’.

All techniques work really well, I am just coming unstuck when I try do a m.predict(test_df) at the very end.

I have done proc_df and apply_cats, however it doesn’t seem that I am doing this correctly as it blows up when I try to make predictions.

Any help/suggestions most welcome.

Kind regards,


(Luke Byrne) #776

Hi Chamin,

I am running into this exact same problem, I have posted my notebook here.

Were you able to solve for this error?

Kind regards,


(Kieran) #777

Hey Luke

I think firstly there is an error on In[15]. Im not quite sure what you are trying to do on this line.

I think the issue is that you did your proc_df() on df_test at the beginning - by the time you trained the most recent model on df_trn2 at In[37] the features will be larger due to the one hot encoding and other processing. If you check on your last line len(df_trn2.columns) I would expect it to be 250 columns long. You need to run your df_test through the same protocols you ran your df_train through to have the same columns at the end.
I hope that makes sense - the example in the Rossman from the DL1 course is good for demonstrating this.

(Luke Byrne) #778

Hi Kieran,

Thanks for taking a look, I will go through it in the next few days and see if that fixes it.



(Sumit) #779


df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

In the above snippet, we get 30k random rows from df_raw which is of length 401125.
And I went on to see how it fetches 30k rows I found this code below fetches random 30k indexes

idxs = sorted(np.random.permutation(len(df))[:30000]) # in get_sample

Now, i’m confused why it’s not that the indexes which are in X_valid

def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

never picks up in the above idxs ?

As a result of which X_train has total different indexes as compared to X_valid.

Please correct me.


I have just completed Lesson 1, so I can’t be 100% sure but can you check what is the shape of X_Train too? Because that is the input dataframe for your model and not df_train. Ideally, they should be the same but just verify once.


I am looking at the code for add_datepart and I am confused about - What do these lines actually do?


if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
    fld_dtype = np.datetime64


df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9



I am trying my first Kaggle Competition after the first lesson. I chose House Prices prediction.

I have tried to replicate all that was done in the first lesson on machine learning, but when I got to fit my model, I got a ValueError: n_estimators must be an integer, got <class 'pandas.core.frame.DataFrame'>.

I don’t understand what I did wrong or what more I should have done: shouldn’t the train_cats function take care of strings in the dataframe and convert them all automatically to numeric ?


You can verify the data types of your df by doing df.dtypes() and see if there are any non-numeric and non-categorical values.

(Joao Ramos) #784

Without seeing your code, it seems like you’re passing the pandas dataframe to the RandomForestRegressor constructor (the first argument is n_estimators and it expects an integer, but you’re giving it a dataframe). Remember that you first create the model and only with, y_train) you fit the model.

(Carlos Crespo) #785


In lesson 7, minute 17 a paper regarding resampling of unbalanced classes in training sets is mentioned. Could someone please help me find that paper (authors, title, or link)?

Thanks in advance for your help, and for this amazing set of lessons.

(Carlos Crespo) #786

I think I found it in this other thread How to duplicate training examples to handle class imbalance.

Copying the link in case it is of interest for anyone else


(Ramesh Kumar Singh) #788

Hi Utkarsh,

As I understood it its a fastai function which has been used to update the learning rate for optimisation of weights and biases,

(Ramesh Kumar Singh) #789

I think min_sample_split is required to make sure you perform a split only if samples/rows/objects are greater than or equal to min_sample_split at current node. If any less is there then the split won’t happen. So if The max_depth is None and you have specified min_samples_split, the tree is not going to grow any further if the node contains less than min_samples_split samples/rows.