Fit() for Random Forest keeps crashing on lesson 1


(Jonathan bechtel) #1

Hello everyone,

I just started the Intro to ML course, and am having problems with the bulldozers dataset used for the first ML problem.

Everytime I call the fit() method on the Random Forest, everything freezes.

Here’s what I’ve done so far:

  • I had some problems getting fast.ai installed on my computer, so I just used straight pandas instead
  • I don’t see the above issue being a problem, since from what I can tell fast.ai mostly repurposes functions from sklearn and pandas.
  • I saved the csv file to feather, and reloaded it.
  • I switched all columns w/ dtype ‘object’ to dtype ‘category’
  • I manually extracted all the time parts from the ‘saledate’ column, similar to add_datepart
  • I filled all empty values for numeric columns with their median value
  • used pd.get_dummies(df, sparse=True) to generate dummy variables. The resulting dataframe has about 7700 columns and a memory usage of about 50 mb.
  • used train_test_split from sklearn to generate a test set with 12000 rows.

I first tried this on my laptop, which has 12 GB of RAM, with about 9 GB free when the fit() method was called using the Random Forest. I’ve tried this with both n_jobs=-1 and without calling it at all.

I also then tried uploading everything to Crestle, figuring that I was using up my available RAM, but this didn’t work either.

I created a new folder for this class in the home directory, uploaded the dataset, and the notebook terminated itself shortly after calling fit() again.

I’m not sure if I did something incorrect in my processing, although it looks to me I performed the same operations on my dataframe as what happened in the original Jupyter notebook, even though I didn’t use the fastai library for it.

Does anyone know what I might be doing wrong?

Thank you.


(Jonathan bechtel) #2

Okay, so I figured out what was causing the problem.

I was creating dummy variables using pd.get_dummies(), which was creating a 7000+ column dataframe.

proc_df just uses df.cat.codes, so the total number of columns stayed the same, which had a big impact on performance.

Which leads to another question…why is there such a big performance difference between the two?

The amount both dataframes took up in memory were not that different, so what is it about having more columns that makes it so much more difficult to compute?