The K-Nearest Neighbour classifier makes predictions by determining the proximity of new observations to existing data points for which labels are known. This is based on the valid assumption that data points which are close together are highly likely to belong to the same class. A KNN classifier works on distance and proximity, but can be successfully applied to non-geometric parameters (such as the amino acid measurement in an E.coli protein dataset). Nonetheless, selected features of the data should support meaningful distance measurement for the purposes of classification. For example, classifying books by measuring the distance of the author’s name to other authors on an alphanumeric scale would be a very good predictor of where the book would be located on the shelf of a book shop, but would be a poor predictor of anything else.
The KNN approach is an implementation of instance based learning. The algorithm does not produce a generalised model based on training data, instead stores instances of the training data as reference. The KNN classifier’s behaviour is governed by how distance is computed, how many neighbours should be considered when making a prediction (a parameter called ‘k’ which gives the model its name), and if all neighbours are considered equally when making a prediction or if a weighting should be applied. Furthermore, the prediction of the model will be skewed if specific features have greater numerical values than others, and so data should be normalised and scaled so that all features have equal mean and variance.