book_nbs/utils.py at master · fastai/book_nbs · GitHub has
corr_condensed = hc.distance.squareform(1-corr). I think
corr_condensed = hc.distance.squareform(1-abs(corr)) would be better, but I wanted to get feedback on my thinking.
hc.distance.squareform is not important here – it simply rearranges the non-redundant off-diagonal elements of the correlation matrix
1-corr seems to be intended to derive a measure of distance from
corr, so that pairs of variables with higher correlation get higher distance. The problem with this approach, I think, is that if you are trying to identify pairs of variables that are potentially redundant for the purposes of creating a predictive model, negative correlation should be treated the same as positive correlation. Consider the extreme case of a binary variable and its opposite. Those variables are perfectly anti-correlated (correlation value -1) and perfectly redundant as far as a model is concerned. With the code as it stands, they will count as maximally distant (distance = 1 - (-1) = 2), but we should be counting them as maximally close (distance = 1 - (1) = 0). We can solve this problem by replacing