Homework - RF Interpretation

Hi,

I’ve started the homework and need help optimizing PDP plot. it’s very slow.

Thanks

Here is the nb link:
https://github.com/KeremTurgutlu/machinelearning/blob/master/INTERPRET%20RANDOM%20FOREST%20-%20SIMPLE%20%26%20EFFECTIVE.ipynb

Update: All done. Still wait for any feedback for improvement. Thanks!

2 Likes

Nice! Suggestions for PDP speed:

(The last suggestion may not help, if there’s too much process communication overhead.)

1 Like

I feel like I need to use np.linspace in order draw this contour plot, but since features are discrete in this example it kind of doesn’t make sense. Is it for visualization purposes or am I missing somehting ?

Thanks

Sorry I don’t follow what you mean by “features are discrete” or your question in general… Can you explain?

PS: np.meshgrid along with broadcasting is handy for this kind of thing.

1 Yeap I will use np.meshgrid and feed it to plt.contourf(x, y, z). But while getting x and y we need to use np.linspace(…).

For example let’s say x = YearMade, which is an ordinal variable with levels. Then my intention is to do this:

x = np.linspace(YearMade.min(), YearMade.max(), ~1000)
y = …same above for a different feature

Then z would be calculated as putting every combination of x,y pairs into dataframe and for each pair calculating the mean of the predictions, since it’s how expectation is defined with integrals. But this definition of expectation is valid for continous variables. For discrete variables which have levels, this definition would become weighted sum of possibilities and probabilities. For this reason I thought maybe applying np.linspace wouldn’t be a very intuitive option but maybe a good option to visualize decision boundaries and different heights of z.

2 For multiprocessing part I did something like it might be helpful for others as well and I am sure there are many parts that can be optimized with numba/cython or multiprocessing too:

Edit This solution is problematic since it’ not allowed to pickle class methods or instances…

YearMade is not ordinal, but continuous. The difference between 1991 and 1992 (for example) has some meaning.

Also, I’d suggest using a few percentiles of the variable, rather than linspace. That way you’ll definitely be looking at values that are actually in the dataset, and will focus on the areas that have plenty of data.

You shouldn’t look at ~1000 levels, but more like ~10. A chart with 1000 levels isn’t going to look any different, but will take 100x longer to create.

I don’t understand your concern still about discrete variables. The expectation you defined sounds fine to me for discrete variables. I don’t see why you’d need any weighting - what kind of weights were you thinking you might need, and why?

You are right about YearMade. My only concern was that plt.contourf wouldn’t give good visualization without providing decent amount of coordinates to fill. I think it shouldn’t be a problem, and I will try with few percentile points.

Confused in implementation of the recursion for Tree interpreter part.

Here is snapshot of the code I tried. The idea was to modify part of predict_row part of the code and extract the predictions and split names until is_leaf becomes true.

But I am not sure why it is appending values every time I call the name contr at the end, even if I tried to re-initialize with empty list?

list.append modifies the list inplace. Also, you should never use [] as a default parameter, since it uses the same list object each time. (Really horrible python design issue IMO)

1 Like

Actually, I wanted to append list inplace so that it appends the predictions after each node and spits out the final list with all predictions + split names after it finds leaf node. But I was expecting it to restart with empty list each time I call the function.

Also, you should never use [] as a default parameter, since it uses the same list object each time – I used this because I was confused how to append stuff recursively, starting from an empty list.

I think it’s better to use while loop to do this rather than recursion.

You should instead use None as the default parameter, and then use an if statement in the body of the function to replace it with an empty list.

It’ll work fine with recursion once you fix this minor issue.

1 Like

It worked. thanks

2 Likes