msg.Machine Learning Catalogue

The nearest neighbour algorithm is used to classify items within a multidimensional space to correspond to the classifications of pre-existing items within that space. It works by examining the k nearest neighbours of the new item and assigning it the classification that occurs most frequently among its neighbours.

There are several hyperparameters that the data scientist can vary:

The definition of distance: classically the geometric distance between the points is used, but alternative definitions associated with the problem at hand are also possible, e.g. Levenshtein distance for word similarity;
The relative weighting of the neighbours whose classifications are examined: especially when one class occurs more frequently than the others, better results are typically obtained where the contribution of a neighbour to the classification of the new data is weighted so that it decreases as the distance between the two points increases;
The number of neighbours whose classifications are examined, i.e. the value of k. Smaller values allow for more sensitivity to local differences within the data but also make the algorithm more vulnerable to overfitting. Where there are only two classes, an odd value of k should be used to prevent tied results;
The number of dimensions / parameters: dimensionality reduction techniques such as principal component analysis are often used to select the most relevant features in advance;
Whether data is cleansed in advance using procedures such as the local outlier factor algorithm;
Whether new data items added to the model will act as neighbours when other new data items are classified in the future: it is normally not appropriate to include them because doing so might cause errors to become self-reinforcing over time. The exception to this rule is when new classifications are somehow confirmed or rejected, giving them the same status as the original training data. This is then a type of active learning.

Condensed Nearest Neighbour

The Condensed Nearest Neighbour or Hart algorithm examines a set of training data and removes those items that are likely to have little effect on classifications because they are located in the middle of a cluster where all their own neighbours share the same class. This leaves only the relevant training items and means that classifications of new data can proceed more rapidly.

alias: k-Nearest-Neighbour kNN
subtype: Condensed Nearest Neighbour Hart algorithm
has functional building block: FBB_Classification
has input data type: IDT_Vector of quantitative variables
has internal model
has output data type: ODT_Classification
has learning style: LST_Supervised
has parametricity: PRM_Nonparametric with hyperparameter(s)
has relevance: REL_Relevant
uses
sometimes supports: ALG_Least Squares Regression ALG_Logistic regression
mathematically similar to