Impute missing data using nearest neighbors

I’m working on a project that has a lot of missing data. To impute this data I’m using a k-nearest-neighbors approach using distances from columns that my rows have in common. There are several online resources that suggest that this is a reasonable method for data imputation:

1: First, if anyone would like to suggest an alternative to this approach or expand on an idea not covered in these previous resources, I would appreciate your feedback.

2: Second, I’ve gone ahead and started making estimates for my missing data using k-nearest-neighbors, but I am feeling uncertain about how to go about choosing the best parameters for the nearest neighbors. Notably, what should be the size of the neighborhood, or what should k be?

I’ve taken a trial and error approach and found that the neighborhood size does impact my model’s accuracy when I use the imputed data for training. Therefore, can someone recommend an approach I can use to find the optimal neighborhood size, or k, to impute my missing data? Should I just try some number, like 10, of different neighborhood sizes and pick the one that gives the best predictions?

3: Third, I get the sense I can do some feature engineering with the k-nearest-neighbor approach, too. Even if I don’t need to estimate the values of any missing information, I could calculate neighbor features for every data point from statistics such as mean, max, min, or standard-deviation. My question is, would these features add anything that a neural-network model wouldn’t be able to “figure out” while it is being trained?

Some of my uncertainty comes from the fact that one could go crazy feature engineering with neighborhoods. For 1 column, Pick 3 sizes (k = 2, 5, 10) and calculate the mean and standard deviation. From 1 number I’ve just created 6 additional numbers. Will I suffer from the curse of dimensionality taking this approach?

Thanks!

1 Like

I have an idea. I am not sure how well it will perform. If you try it out, let me know if it worked.

Say you have a feature missing in some of the records in your dataset.

  • Cluster all the records based on all the features except the missing one (the feature you’re trying to impute).
  • Find the clusters to which the records with the missing feature belong to.
  • Impute each record with the missing value based on the data from the records in the cluster it belongs to.

Interesting post, @bitfrosting!

Question #2: It would be good to proceed systematically:
(A) Use a grid over number of neighbors, say [5, 10, 15, 20, 25] and see if there is a number that optimizes your validation accuracy.
(B) Use a grid over Euclidean distances instead: i.e. take the mean (or median) of all nearest neighbors within distances of [d1, d2, d3, d4, d5]. Again, see if you can find a distance that optimizes validation accuracy.

Then: which gave the best result, (A) or (B)?

#3 This seems like a great idea! I would keep it simple: try adding only the standard deviation. I think the standard deviation carries a lot of information.

I’d love to hear what you find out!

Is you data purely static? I’ve asked a similar question in regards to time-series imputation.