Sam Stites

K-nearest neighbors

Just a little refresher on KNNs as I swing back into thinking about statistics and data science.


KNNs are fairly compute-intensive since they do a full search through your entire training set to find a single datapoint. Usually the mean for regression and the mode for classification. Usually one would use euclidean distance in your calculations, however you must always be aware of whether or not your attributes are all normalized and of the same scale.

In high dimensions, you might witness the curse dimensionality and “closeness” might start to break down - so be wary of this and, in such a case, extract only the most important variables for your inputs.