Feature importance , redundant features

Hi all

in ML course (2017)
lesson 3 - in rf_interpretation

i tried plotting hierarchical clustering without feature importance (finding the most effective independent features)
i mean instead of just keeping the 30 important columns (first image)
i used all columns (second image)

then i got this error “Distance matrix ‘X’ must be symmetric”

why is that ?

i’m not sure , but i think that error raised because of one or several columns has some problems
or because the number of columns are too many

@sgugger
@PierreO
@init_27

could you please answer this question?

full stack trace
{https://github.com/mahdirezaey/as/blob/master/forums.ipynb }

Can you share the entire stacktrace?

sure!
{https://github.com/mahdirezaey/as/blob/master/forums.ipynb}

This is my guess of what is happening. I’m not sure though.

I see a lot of invalid value encountered in the code block that raised the error. There are lot of non numeric data in your data frame. You could try encoding them as categorical data.

@BBloggsbott , i think proc_df will do that automatically

turning into categorical and
considering NaN as 0
filling missed in numeric by mean

i have update that (one_hot_encoded now)
could you please check this link again {https://github.com/mahdirezaey/as/blob/master/forums.ipynb}

and steel that error

corr = np.round(scipy.stats.spearmanr(df_trn).correlation, 4)
Based on the docs scipy.stats.spearman returns a correlation matrix. You’ve used .correlation here. This might be because you use a different version.

P.S I haven’t done the fastai course or use the package before. So, I’m just pointing out the possible errors.

tanks anyway !

Do you see the same error if you put labels=None. ? I am not sure whats going on. In my case I have faced the same error.