book_nbs/utils.py at master · fastai/book_nbs · GitHub has `corr_condensed = hc.distance.squareform(1-corr)`

. I think `corr_condensed = hc.distance.squareform(1-abs(corr))`

would be better, but I wanted to get feedback on my thinking.

`hc.distance.squareform`

is not important here – it simply rearranges the non-redundant off-diagonal elements of the correlation matrix `corr`

.

`1-corr`

seems to be intended to derive a measure of distance from `corr`

, so that pairs of variables with higher correlation get higher distance. The problem with this approach, I think, is that if you are trying to identify pairs of variables that are potentially redundant for the purposes of creating a predictive model, *negative correlation should be treated the same as positive correlation*. Consider the extreme case of a binary variable and its opposite. Those variables are perfectly anti-correlated (correlation value -1) and perfectly redundant as far as a model is concerned. With the code as it stands, they will count as maximally distant (distance = 1 - (-1) = 2), but we should be counting them as maximally close (distance = 1 - (1) = 0). We can solve this problem by replacing `1-corr`

with `1-abs(corr)`

.