I’m working on a project that has a lot of missing data. To impute this data I’m using a k-nearest-neighbors approach using distances from columns that my rows have in common. There are several online resources that suggest that this is a reasonable method for data imputation:
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4959387
- https://stats.stackexchange.com/questions/327074/k-nearest-neighbor-imputation-explanation
- https://towardsdatascience.com/the-use-of-knn-for-missing-values-cf33d935c637
1: First, if anyone would like to suggest an alternative to this approach or expand on an idea not covered in these previous resources, I would appreciate your feedback.
2: Second, I’ve gone ahead and started making estimates for my missing data using k-nearest-neighbors, but I am feeling uncertain about how to go about choosing the best parameters for the nearest neighbors. Notably, what should be the size of the neighborhood, or what should k be?
I’ve taken a trial and error approach and found that the neighborhood size does impact my model’s accuracy when I use the imputed data for training. Therefore, can someone recommend an approach I can use to find the optimal neighborhood size, or k, to impute my missing data? Should I just try some number, like 10, of different neighborhood sizes and pick the one that gives the best predictions?
3: Third, I get the sense I can do some feature engineering with the k-nearest-neighbor approach, too. Even if I don’t need to estimate the values of any missing information, I could calculate neighbor features for every data point from statistics such as mean, max, min, or standard-deviation. My question is, would these features add anything that a neural-network model wouldn’t be able to “figure out” while it is being trained?
Some of my uncertainty comes from the fact that one could go crazy feature engineering with neighborhoods. For 1 column, Pick 3 sizes (k = 2, 5, 10) and calculate the mean and standard deviation. From 1 number I’ve just created 6 additional numbers. Will I suffer from the curse of dimensionality taking this approach?
Thanks!