Another treat! Early access to Intro To Machine Learning videos

(Kieran) #762

I think he says somehwere that when you look at the dataframe it will still show you it in string format but actually when you m.predict it will use the number - I remember it being somewhere in the lectures I will look tonight.


(aekapong1) #763

ขอบคุนครับได้ความรู้เยอะเเยะเรยขอบคุนมากๆๆคับผมจะนำไปใช้ที่หลังคับ : M[size=1px]ufabet[/size]

(Naveed Unjum) #764

Uploading… This is after applying train_cats,and setting the UsageBand to codes

(Joao Ramos) #765

If I’m not mistaken you should use apply_cats() on the test set.

(Naveed Unjum) #766

It still doesnt work. proc_df changes the categorical values into numbers. But since we don’t have a y variable in the test set, that won’t work. So how do i change the categorical data into number data?
I will appreciate if you share a link to one of your kernels showing the same.

(Joao Ramos) #768

I did a quick running example using the bulldozers dataset and RFs. I tried to show the before/after of each step.

(Naveed Unjum) #769

I app-ly apply_cats to the test set. Then how do i change those categorical values to numbers as proc_df is only for the train set with target variable? Also if you could please share one of the kernels with this application

(Aditya) #770

via and assigning them to the data frame…
It’s in the notebook .


i am getting the same issue at paperspace today, strangely yesterday it run fine, hmm… what is the issue and what could be the fix?

my setup:
RAM: 30 GB
HD: 34.7 GB / 250 GB


managed to get it work, had to comment out the save to feather for some reason…


I am getting memory error when executing proc_df on New York City Taxi Fare Prediction training data, which is about 55M rows, 7 columns. Is there any way to make it work at the machine i am using at paperspace (spec below)? Or do i need to upgrade to machine with more memory?

Paperspace machine setup:
RAM: 30 GB
HD: 34.7 GB / 250 GB


df, y, nas = proc_df(df_raw, 'fare_amount')

error details:

MemoryError                               Traceback (most recent call last)
~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _getbool_axis(self, key, axis)
   1495         try:
-> 1496             return self.obj._take(inds, axis=axis)
   1497         except Exception as detail:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _take(self, indices, axis, is_copy)
   2784     def _take(self, indices, axis=0, is_copy=True):
-> 2785         self._consolidate_inplace()

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _consolidate_inplace(self)
-> 4439         self._protect_consolidate(f)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _protect_consolidate(self, f)
   4427         blocks_before = len(self._data.blocks)
-> 4428         result = f()
   4429         if len(self._data.blocks) != blocks_before:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in f()
   4436         def f():
-> 4437             self._data = self._data.consolidate()

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in consolidate(self)
   4097         bm._is_consolidated = False
-> 4098         bm._consolidate_inplace()
   4099         return bm

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _consolidate_inplace(self)
   4102         if not self.is_consolidated():
-> 4103             self.blocks = tuple(_consolidate(self.blocks))
   4104             self._is_consolidated = True

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _consolidate(blocks)
   5068         merged_blocks = _merge_blocks(list(group_blocks), dtype=dtype,
-> 5069                                       _can_consolidate=_can_consolidate)
   5070         new_blocks = _extend_blocks(merged_blocks, new_blocks)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _merge_blocks(blocks, dtype, _can_consolidate)
   5091         argsort = np.argsort(new_mgr_locs)
-> 5092         new_values = new_values[argsort]
   5093         new_mgr_locs = new_mgr_locs[argsort]


During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-23-35a73d8cb330> in <module>()
----> 1 df, y, nas = proc_df(df_raw, 'fare_amount')

~/fastai/courses/ml1/fastai/ in proc_df(df, y_fld, skip_flds, ignore_flds, do_scale, na_dict, preproc_fn, max_n_cat, subset, mapper)
    445     if do_scale: mapper = scale_vars(df, mapper)
    446     for n,c in df.items(): numericalize(df, c, n, max_n_cat)
--> 447     df = pd.get_dummies(df, dummy_na=True)
    448     df = pd.concat([ignored_flds, df], axis=1)
    449     res = [df, y, na_dict]

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/reshape/ in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
    840         if columns is None:
    841             data_to_encode = data.select_dtypes(
--> 842                 include=dtypes_to_encode)
    843         else:
    844             data_to_encode = data[columns]

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in select_dtypes(self, include, exclude)
   3090         dtype_indexer = include_these & exclude_these
-> 3091         return self.loc[com._get_info_slice(self, dtype_indexer)]
   3093     def _box_item_values(self, key, values):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             # we by definition only have the 0th axis

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _getitem_tuple(self, tup)
    888                 continue
--> 890             retval = getattr(retval,, axis=i)
    892         return retval

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _getitem_axis(self, key, axis)
   1866             return self._get_slice_axis(key, axis=axis)
   1867         elif com.is_bool_indexer(key):
-> 1868             return self._getbool_axis(key, axis=axis)
   1869         elif is_list_like_indexer(key):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/ in _getbool_axis(self, key, axis)
   1496             return self.obj._take(inds, axis=axis)
   1497         except Exception as detail:
-> 1498             raise self._exception(detail)
   1500     def _get_slice_axis(self, slice_obj, axis=None):

KeyError: MemoryError()

(Luke Byrne) #775

Hi all,

Can someone if they have the time have a look at this Gist. I am using the Random Forrest method and applying it to the ‘Kaggle - House Prices: Advanced Regression Techniques’.

All techniques work really well, I am just coming unstuck when I try do a m.predict(test_df) at the very end.

I have done proc_df and apply_cats, however it doesn’t seem that I am doing this correctly as it blows up when I try to make predictions.

Any help/suggestions most welcome.

Kind regards,


(Luke Byrne) #776

Hi Chamin,

I am running into this exact same problem, I have posted my notebook here.

Were you able to solve for this error?

Kind regards,


(Kieran) #777

Hey Luke

I think firstly there is an error on In[15]. Im not quite sure what you are trying to do on this line.

I think the issue is that you did your proc_df() on df_test at the beginning - by the time you trained the most recent model on df_trn2 at In[37] the features will be larger due to the one hot encoding and other processing. If you check on your last line len(df_trn2.columns) I would expect it to be 250 columns long. You need to run your df_test through the same protocols you ran your df_train through to have the same columns at the end.
I hope that makes sense - the example in the Rossman from the DL1 course is good for demonstrating this.

(Luke Byrne) #778

Hi Kieran,

Thanks for taking a look, I will go through it in the next few days and see if that fixes it.



(Sumit) #779


df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

In the above snippet, we get 30k random rows from df_raw which is of length 401125.
And I went on to see how it fetches 30k rows I found this code below fetches random 30k indexes

idxs = sorted(np.random.permutation(len(df))[:30000]) # in get_sample

Now, i’m confused why it’s not that the indexes which are in X_valid

def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

never picks up in the above idxs ?

As a result of which X_train has total different indexes as compared to X_valid.

Please correct me.


I have just completed Lesson 1, so I can’t be 100% sure but can you check what is the shape of X_Train too? Because that is the input dataframe for your model and not df_train. Ideally, they should be the same but just verify once.


I am looking at the code for add_datepart and I am confused about - What do these lines actually do?


if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
    fld_dtype = np.datetime64


df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9



I am trying my first Kaggle Competition after the first lesson. I chose House Prices prediction.

I have tried to replicate all that was done in the first lesson on machine learning, but when I got to fit my model, I got a ValueError: n_estimators must be an integer, got <class 'pandas.core.frame.DataFrame'>.

I don’t understand what I did wrong or what more I should have done: shouldn’t the train_cats function take care of strings in the dataframe and convert them all automatically to numeric ?


You can verify the data types of your df by doing df.dtypes() and see if there are any non-numeric and non-categorical values.