Nearest Neighbour

Algorithm

The nearest neighbour algorithm classifies items based on the classifications of their closest neighbours in a multidimensional space.

This algorithm is used in scenarios where the goal is to classify new items based on the classifications of existing items within the same space. It works by examining the k nearest neighbours of a new item and assigning it the classification that occurs most frequently among its neighbours.

There are several hyperparameters that the data scientist can vary:

  • The definition of distance: classically the geometric distance between the points is used, but alternative definitions associated with the problem at hand are also possible, e.g. Levenshtein distance for word similarity.
  • The relative weighting of the neighbours whose classifications are examined: especially when one class occurs more frequently than the others, better results are typically obtained where the contribution of a neighbour to the classification of the new data is weighted so that it decreases as the distance between the two points increases.
  • The number of neighbours whose classifications are examined, i.e. the value of k. Smaller values allow for more sensitivity to local differences within the data but also make the algorithm more vulnerable to overfitting. Where there are only two classes, an odd value of k should be used to prevent tied results.
  • The number of dimensions / parameters: dimensionality reduction techniques such as principal component analysis are often used to select the most relevant features in advance.
  • Whether data is cleansed in advance using procedures such as the local outlier factor algorithm.
  • Whether new data items added to the model will act as neighbours when other new data items are classified in the future: it is normally not appropriate to include them because doing so might cause errors to become self-reinforcing over time. The exception to this rule is when new classifications are somehow confirmed or rejected, giving them the same status as the original training data. This is then a type of active learning.

For example, in a dataset of animals, if a new animal is introduced, the nearest neighbour algorithm will classify it based on the most common classification among its closest neighbours, such as “mammal” or “reptile”.

In summary, the nearest neighbour algorithm is a simple yet powerful tool for classification tasks, especially when the relationships between data points are complex and multidimensional.

Alias
k-Nearest-Neighbour kNN
Related terms
Classification Supervised Learning Dimensionality Reduction