Another treat! Early access to Intro To Machine Learning videos

miwojc · September 5, 2018, 2:47am

I am getting memory error when executing proc_df on New York City Taxi Fare Prediction training data, which is about 55M rows, 7 columns. Is there any way to make it work at the machine i am using at paperspace (spec below)? Or do i need to upgrade to machine with more memory?

Paperspace machine setup:
MACHINE TYPE: P4000 HOURLY
REGION: CA1
RAM: 30 GB
CPUS: 8
HD: 34.7 GB / 250 GB
GPU: 8 GB

code:

df, y, nas = proc_df(df_raw, 'fare_amount')

error details:

MemoryError                               Traceback (most recent call last)
~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexing.py in _getbool_axis(self, key, axis)
   1495         try:
-> 1496             return self.obj._take(inds, axis=axis)
   1497         except Exception as detail:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/generic.py in _take(self, indices, axis, is_copy)
   2784     def _take(self, indices, axis=0, is_copy=True):
-> 2785         self._consolidate_inplace()
   2786 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/generic.py in _consolidate_inplace(self)
   4438 
-> 4439         self._protect_consolidate(f)
   4440 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/generic.py in _protect_consolidate(self, f)
   4427         blocks_before = len(self._data.blocks)
-> 4428         result = f()
   4429         if len(self._data.blocks) != blocks_before:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/generic.py in f()
   4436         def f():
-> 4437             self._data = self._data.consolidate()
   4438 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/internals.py in consolidate(self)
   4097         bm._is_consolidated = False
-> 4098         bm._consolidate_inplace()
   4099         return bm

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/internals.py in _consolidate_inplace(self)
   4102         if not self.is_consolidated():
-> 4103             self.blocks = tuple(_consolidate(self.blocks))
   4104             self._is_consolidated = True

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/internals.py in _consolidate(blocks)
   5068         merged_blocks = _merge_blocks(list(group_blocks), dtype=dtype,
-> 5069                                       _can_consolidate=_can_consolidate)
   5070         new_blocks = _extend_blocks(merged_blocks, new_blocks)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/internals.py in _merge_blocks(blocks, dtype, _can_consolidate)
   5091         argsort = np.argsort(new_mgr_locs)
-> 5092         new_values = new_values[argsort]
   5093         new_mgr_locs = new_mgr_locs[argsort]

MemoryError: 

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-23-35a73d8cb330> in <module>()
----> 1 df, y, nas = proc_df(df_raw, 'fare_amount')

~/fastai/courses/ml1/fastai/structured.py in proc_df(df, y_fld, skip_flds, ignore_flds, do_scale, na_dict, preproc_fn, max_n_cat, subset, mapper)
    445     if do_scale: mapper = scale_vars(df, mapper)
    446     for n,c in df.items(): numericalize(df, c, n, max_n_cat)
--> 447     df = pd.get_dummies(df, dummy_na=True)
    448     df = pd.concat([ignored_flds, df], axis=1)
    449     res = [df, y, na_dict]

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
    840         if columns is None:
    841             data_to_encode = data.select_dtypes(
--> 842                 include=dtypes_to_encode)
    843         else:
    844             data_to_encode = data[columns]

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/frame.py in select_dtypes(self, include, exclude)
   3089 
   3090         dtype_indexer = include_these & exclude_these
-> 3091         return self.loc[com._get_info_slice(self, dtype_indexer)]
   3092 
   3093     def _box_item_values(self, key, values):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             # we by definition only have the 0th axis

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
    888                 continue
    889 
--> 890             retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
    891 
    892         return retval

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1866             return self._get_slice_axis(key, axis=axis)
   1867         elif com.is_bool_indexer(key):
-> 1868             return self._getbool_axis(key, axis=axis)
   1869         elif is_list_like_indexer(key):
   1870 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexing.py in _getbool_axis(self, key, axis)
   1496             return self.obj._take(inds, axis=axis)
   1497         except Exception as detail:
-> 1498             raise self._exception(detail)
   1499 
   1500     def _get_slice_axis(self, slice_obj, axis=None):

KeyError: MemoryError()

lukebyrne · September 6, 2018, 7:25am

Hi all,

Can someone if they have the time have a look at this Gist. I am using the Random Forrest method and applying it to the ‘Kaggle - House Prices: Advanced Regression Techniques’.

All techniques work really well, I am just coming unstuck when I try do a m.predict(test_df) at the very end.

I have done proc_df and apply_cats, however it doesn’t seem that I am doing this correctly as it blows up when I try to make predictions.

Any help/suggestions most welcome.

Kind regards,

Luke

gist.github.com

https://gist.github.com/lukebyrne/8360a490983cbb78ebe6456fcf03842f

house-prices-random-forrest.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "%reload_ext autoreload\n",

This file has been truncated. show original

lukebyrne · September 6, 2018, 7:31am

Hi Chamin,

I am running into this exact same problem, I have posted my notebook here.

gist.github.com

https://gist.github.com/lukebyrne/8360a490983cbb78ebe6456fcf03842f

house-prices-random-forrest.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "%reload_ext autoreload\n",

This file has been truncated. show original

Were you able to solve for this error?

Kind regards,

Luke

mcclomitz · September 7, 2018, 4:41pm

Hey Luke

I think firstly there is an error on In[15]. Im not quite sure what you are trying to do on this line.

I think the issue is that you did your proc_df() on df_test at the beginning - by the time you trained the most recent model on df_trn2 at In[37] the features will be larger due to the one hot encoding and other processing. If you check on your last line len(df_trn2.columns) I would expect it to be 250 columns long. You need to run your df_test through the same protocols you ran your df_train through to have the same columns at the end.
I hope that makes sense - the example in the Rossman from the DL1 course is good for demonstrating this.

lukebyrne · September 10, 2018, 12:52am

Hi Kieran,

Thanks for taking a look, I will go through it in the next few days and see if that fixes it.

Cheers,

Luke

SKS · September 11, 2018, 2:27am

Hi,

df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

In the above snippet, we get 30k random rows from df_raw which is of length 401125.
And I went on to see how it fetches 30k rows I found this code below fetches random 30k indexes

idxs = sorted(np.random.permutation(len(df))[:30000]) # in get_sample

Now, i’m confused why it’s not that the indexes which are in X_valid

def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

never picks up in the above idxs ?

As a result of which X_train has total different indexes as compared to X_valid.

Please correct me.

number007 · September 12, 2018, 4:23am

I have just completed Lesson 1, so I can’t be 100% sure but can you check what is the shape of X_Train too? Because that is the input dataframe for your model and not df_train. Ideally, they should be the same but just verify once.

number007 · September 12, 2018, 4:28am

I am looking at the code for add_datepart and I am confused about - What do these lines actually do?

First,

if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
    fld_dtype = np.datetime64

Second,

df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9

fastai1 · September 13, 2018, 6:12am

Hello,

I am trying my first Kaggle Competition after the first lesson. I chose House Prices prediction.

I have tried to replicate all that was done in the first lesson on machine learning, but when I got to fit my model, I got a ValueError: n_estimators must be an integer, got <class 'pandas.core.frame.DataFrame'>.

I don’t understand what I did wrong or what more I should have done: shouldn’t the train_cats function take care of strings in the dataframe and convert them all automatically to numeric ?

number007 · September 13, 2018, 10:15am

You can verify the data types of your df by doing df.dtypes() and see if there are any non-numeric and non-categorical values.

jpramos · September 13, 2018, 10:33am

Without seeing your code, it seems like you’re passing the pandas dataframe to the RandomForestRegressor constructor (the first argument is n_estimators and it expects an integer, but you’re giving it a dataframe). Remember that you first create the model and only with m.fit(X_train, y_train) you fit the model.

c.crespo · September 14, 2018, 3:08am

Hello,

In lesson 7, minute 17 a paper regarding resampling of unbalanced classes in training sets is mentioned. Could someone please help me find that paper (authors, title, or link)?

Thanks in advance for your help, and for this amazing set of lessons.

c.crespo · September 14, 2018, 3:13am

I think I found it in this other thread How to duplicate training examples to handle class imbalance.

Copying the link in case it is of interest for anyone else https://arxiv.org/pdf/1710.05381.pdf

Thanks.

rameshsingh · September 15, 2018, 12:03pm

Hi Utkarsh,

As I understood it its a fastai function which has been used to update the learning rate for optimisation of weights and biases,

rameshsingh · September 15, 2018, 12:23pm

I think min_sample_split is required to make sure you perform a split only if samples/rows/objects are greater than or equal to min_sample_split at current node. If any less is there then the split won’t happen. So if The max_depth is None and you have specified min_samples_split, the tree is not going to grow any further if the node contains less than min_samples_split samples/rows.

rameshsingh · September 15, 2018, 12:25pm

I second @jpramos reply. Try to name the parameters you are trying to send and hopefully it will resolve.

rameshsingh · September 15, 2018, 12:35pm

Hey Sashank,
I think it would be very simple to use categorical codes for random forests/ensemble/tree models and explain the results. If we use categorical codes in linear regression/logistic regression, the values assigned to categorical codes value may cause interference in building models. So wherever you are using decision/rule based models its OK to use categorical codes or one hot encoded variable. But when you are using anything other than rule/decision based algorithm I would suggest to use One Hot Encoding.

fastai1 · September 15, 2018, 2:42pm

Hello,

I finished my first Kaggle competition and got a surprisingly good result for the dataset the first time I ran the model.

However, when I tried to separate the set in a training and a validation set, I have had worse results.

These are the results for the whole set:

[0.06640609395146409, 0.059749794949568814, 0.9724835235033371, 0.9731294015377561]

And those are the scores for the training and validation tests (I used a validation set of 43 rows, it’s roughly 0.02 percent of the whole set):

[0.06867925757638324, 0.15292615297037612, 0.9705674334979815, 0.8239775636830122]

By separating the set in a validation set and a training set, I fell well behind in the leaderboard (like in the 75%).
While using the whole set (I hope I didn’t make any mistake), it got me to the first place on the leaderboard.

Why is there so much difference in the predictions? Is it because my training set is too small to divide it up into two sets?

What’s the conclusion? Should you only divide your set when it is large enough?

dori · September 17, 2018, 6:00am

Hi,

I recently started the ML part 1, and for that purpose created a GPU enabled GCP instance. But I realize that the course (at least the lesson 1) is only using the instance’s CPU.
What setting should I tweak in order to get the lesson 1 notebook execute cells using GPU ?

Also, following Jeremy’s request at the end of the lesson 1, I went to the first kaggle competition I could find and tried to prepare the data in order to run Random forest on it, but I’m facing a big issue:
While the video describes a dataset where each row has its own target value, the dataset I’m playing with has several rows per user, and the target to estimate is the log of the sum of a column for all rows grouped by user.

Now, to continue, I can only think of 2 options:

recreate a new dataset with only one row per user, data being merged/averaged/etc (feels like we’ll lose information this way)
create a new column logTotalRevenue on each row, containing the correct target value (I can have this done although it’s an extremely slow function). But it feels like random forest cannot work this way.

Can someone give me some pointers on the proper way to apply random forests on this competition ?

number007 · September 17, 2018, 2:42pm

What do you mean by results from whole set? If you try scoring the whole set you will have only 1 RMSE and score value.

When you have two values that means there are two data sets - train and validation.

Without the code it is difficult to know what you did but if you are running jeremy’s code as-is then the second result happens after you sample the data. Jeremy had taken out 30k for faster processing.