Should `cluster_columns` use absolute value?

gsganden · August 4, 2021, 12:10am

book_nbs/utils.py at master · fastai/book_nbs · GitHub has corr_condensed = hc.distance.squareform(1-corr). I think corr_condensed = hc.distance.squareform(1-abs(corr)) would be better, but I wanted to get feedback on my thinking.

hc.distance.squareform is not important here – it simply rearranges the non-redundant off-diagonal elements of the correlation matrix corr.

1-corr seems to be intended to derive a measure of distance from corr, so that pairs of variables with higher correlation get higher distance. The problem with this approach, I think, is that if you are trying to identify pairs of variables that are potentially redundant for the purposes of creating a predictive model, negative correlation should be treated the same as positive correlation. Consider the extreme case of a binary variable and its opposite. Those variables are perfectly anti-correlated (correlation value -1) and perfectly redundant as far as a model is concerned. With the code as it stands, they will count as maximally distant (distance = 1 - (-1) = 2), but we should be counting them as maximally close (distance = 1 - (1) = 0). We can solve this problem by replacing 1-corr with 1-abs(corr).