Another treat! Early access to Intro To Machine Learning videos

I don’t think this is the same issue which I had, however on a different dataset when I ran proc_df on my training and my validation sets, I ended up with a validation set which had 92 columns vs my training set which had 80 columns.

After a LOT of head scratching, I discovered that in my validation set there were a number of columns with missing values, which didn’t exist in my training data. This resulted in extra columns in my validation set a “_na” extension on the name (e.g. “column_na”).

As mentioned, not the same issue, but might give you some ideas on where / how to hunt.

Todd

Personal opinion, but if your dataset is small enough to run very quickly when developing your model, then I would not use set_rf_samples. Looked at from the opposite side, if your dataset is so large that everytime you run it you have to stare at your screen waiting for it to process (i.e. you cannot easily interact with it), then creating your initial model with a sample is really useful.

How big is your dataset?

Hi Everyone,

In Lesson 3, @jeremy discusses the concept of feature importance. Around 1:16:00, he shows us two plots. The first plot has the feature importance with all the variables, and the second plot shows the feature importance with the variables which are more important. I don’t understand why, the feature importance value of the variable Coupler System is lower in the second plot than the first.

Regarding random forests, why is it that uncorrelated errors, when averaged out, lead to a low overall error. Why can’t it be that averaging out uncorrelated errors would lead to a high error? Can someone please explain?

For anyone who plays with Cython from lecture 7, here are a couple of tricks/tips which I learnt the slow way:

  1. You cannot have a comment preceding the %%cython declaration (note it can come AFTER the %%cython declaration)

  2. You cannot run %timeit in the same cell as the %%cython code, as it will produce an error. Needs to be included in another cell

There are no doubt some other ‘quirks’ with %%cython, but these were the ones which tripped me up initially.

Todd

P.S. anyone wondering why I have n = 2000**2000, I was just playing with larger numbers to see the impact.

2 Likes

Hoping to get advice/guidance on show to handle large files so that I can run random forest.

The data is 7GB and its from a Kaggle comp called TalkingData AdTracking Fraud Detection Challenge I was able to load the data by specifying the data type in a dictionary and passing that to read_csv() but as soon as I started trying to process the data, I started hitting memory errors. Specifically, I tried running add_datepart() and to_feather() For additional context, I am using Gradient on Paperspace with a GPU which has 30GB RAM & 8 cores. Given, this I was wondering what’s the best way to process large files and run Random Forests.

From what I searched on other forums threads, it seems like they are splitting files but I was hoping someone encountered a specific example that they can share here. Thank you!

Update!! - Found the following post which gave me the answers. Not sure why I didn’t find it earlier: Most effective ways to merge “big data” on a single machine

ML Lesson 1: I perform the same steps on the test data provided including train_cats but still while predicting, the model recognises some string data in the test set. How to get over that

How do i change the actual test set into categorical variables. I apply train_cats() on the test set but when i perform m.predict(test) , it shows that the the strings are unchanged?

I think he says somehwere that when you look at the dataframe it will still show you it in string format but actually when you m.predict it will use the number - I remember it being somewhere in the lectures I will look tonight.

:slight_smile:

ขอบคุนครับได้ความรู้เยอะเเยะเรยขอบคุนมากๆๆคับผมจะนำไปใช้ที่หลังคับ : M[size=1px]ufabet[/size]
[size=1px]สมัครufabet[/size]

Uploading… This is after applying train_cats,and setting the UsageBand to codes

If I’m not mistaken you should use apply_cats() on the test set.

It still doesnt work. proc_df changes the categorical values into numbers. But since we don’t have a y variable in the test set, that won’t work. So how do i change the categorical data into number data?
I will appreciate if you share a link to one of your kernels showing the same.

I did a quick running example using the bulldozers dataset and RFs. I tried to show the before/after of each step.

I app-ly apply_cats to the test set. Then how do i change those categorical values to numbers as proc_df is only for the train set with target variable? Also if you could please share one of the kernels with this application

via .cat.codes and assigning them to the data frame…
It’s in the notebook .

i am getting the same issue at paperspace today, strangely yesterday it run fine, hmm… what is the issue and what could be the fix?

my setup:
MACHINE TYPE: P4000 HOURLY
REGION: CA1
RAM: 30 GB
CPUS: 8
HD: 34.7 GB / 250 GB
GPU: 8 GB

managed to get it work, had to comment out the save to feather for some reason…
image

I am getting memory error when executing proc_df on New York City Taxi Fare Prediction training data, which is about 55M rows, 7 columns. Is there any way to make it work at the machine i am using at paperspace (spec below)? Or do i need to upgrade to machine with more memory?

Paperspace machine setup:
MACHINE TYPE: P4000 HOURLY
REGION: CA1
RAM: 30 GB
CPUS: 8
HD: 34.7 GB / 250 GB
GPU: 8 GB

code:

df, y, nas = proc_df(df_raw, 'fare_amount')

error details:

MemoryError                               Traceback (most recent call last)
~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexing.py in _getbool_axis(self, key, axis)
   1495         try:
-> 1496             return self.obj._take(inds, axis=axis)
   1497         except Exception as detail:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/generic.py in _take(self, indices, axis, is_copy)
   2784     def _take(self, indices, axis=0, is_copy=True):
-> 2785         self._consolidate_inplace()
   2786 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/generic.py in _consolidate_inplace(self)
   4438 
-> 4439         self._protect_consolidate(f)
   4440 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/generic.py in _protect_consolidate(self, f)
   4427         blocks_before = len(self._data.blocks)
-> 4428         result = f()
   4429         if len(self._data.blocks) != blocks_before:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/generic.py in f()
   4436         def f():
-> 4437             self._data = self._data.consolidate()
   4438 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/internals.py in consolidate(self)
   4097         bm._is_consolidated = False
-> 4098         bm._consolidate_inplace()
   4099         return bm

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/internals.py in _consolidate_inplace(self)
   4102         if not self.is_consolidated():
-> 4103             self.blocks = tuple(_consolidate(self.blocks))
   4104             self._is_consolidated = True

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/internals.py in _consolidate(blocks)
   5068         merged_blocks = _merge_blocks(list(group_blocks), dtype=dtype,
-> 5069                                       _can_consolidate=_can_consolidate)
   5070         new_blocks = _extend_blocks(merged_blocks, new_blocks)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/internals.py in _merge_blocks(blocks, dtype, _can_consolidate)
   5091         argsort = np.argsort(new_mgr_locs)
-> 5092         new_values = new_values[argsort]
   5093         new_mgr_locs = new_mgr_locs[argsort]

MemoryError: 

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-23-35a73d8cb330> in <module>()
----> 1 df, y, nas = proc_df(df_raw, 'fare_amount')

~/fastai/courses/ml1/fastai/structured.py in proc_df(df, y_fld, skip_flds, ignore_flds, do_scale, na_dict, preproc_fn, max_n_cat, subset, mapper)
    445     if do_scale: mapper = scale_vars(df, mapper)
    446     for n,c in df.items(): numericalize(df, c, n, max_n_cat)
--> 447     df = pd.get_dummies(df, dummy_na=True)
    448     df = pd.concat([ignored_flds, df], axis=1)
    449     res = [df, y, na_dict]

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
    840         if columns is None:
    841             data_to_encode = data.select_dtypes(
--> 842                 include=dtypes_to_encode)
    843         else:
    844             data_to_encode = data[columns]

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/frame.py in select_dtypes(self, include, exclude)
   3089 
   3090         dtype_indexer = include_these & exclude_these
-> 3091         return self.loc[com._get_info_slice(self, dtype_indexer)]
   3092 
   3093     def _box_item_values(self, key, values):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             # we by definition only have the 0th axis

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
    888                 continue
    889 
--> 890             retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
    891 
    892         return retval

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1866             return self._get_slice_axis(key, axis=axis)
   1867         elif com.is_bool_indexer(key):
-> 1868             return self._getbool_axis(key, axis=axis)
   1869         elif is_list_like_indexer(key):
   1870 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexing.py in _getbool_axis(self, key, axis)
   1496             return self.obj._take(inds, axis=axis)
   1497         except Exception as detail:
-> 1498             raise self._exception(detail)
   1499 
   1500     def _get_slice_axis(self, slice_obj, axis=None):

KeyError: MemoryError()

Hi all,

Can someone if they have the time have a look at this Gist. I am using the Random Forrest method and applying it to the ‘Kaggle - House Prices: Advanced Regression Techniques’.

All techniques work really well, I am just coming unstuck when I try do a m.predict(test_df) at the very end.

I have done proc_df and apply_cats, however it doesn’t seem that I am doing this correctly as it blows up when I try to make predictions.

Any help/suggestions most welcome.

Kind regards,

Luke