The nearest neighbour algorithm is used to classify items within a multidimensional space to correspond to the classifications of pre-existing items within that space. It works by examining the k nearest neighbours of the new item and assigning it the classification that occurs most frequently among its neighbours.
There are several hyperparameters that the data scientist can vary:
- The definition of distance: classically the geometric distance between the points is used, but alternative definitions associated with the problem at hand are also possible, e.g. Levenshtein distance for word similarity;
- The relative weighting of the neighbours whose classifications are examined: especially when one class occurs more frequently than the others, better results are typically obtained where the contribution of a neighbour to the classification of the new data is weighted so that it decreases as the distance between the two points increases;
- The number of neighbours whose classifications are examined, i.e. the value of k. Smaller values allow for more sensitivity to local differences within the data but also make the algorithm more vulnerable to overfitting. Where there are only two classes, an odd value of k should be used to prevent tied results;
- The number of dimensions / parameters: dimensionality reduction techniques such as principal component analysis are often used to select the most relevant features in advance;
- Whether data is cleansed in advance using procedures such as the local outlier factor algorithm;
- Whether new data items added to the model will act as neighbours when other new data items are classified in the future: it is normally not appropriate to include them because doing so might cause errors to become self-reinforcing over time. The exception to this rule is when new classifications are somehow confirmed or rejected, giving them the same status as the original training data. This is then a type of active learning.
Condensed Nearest Neighbour
The Condensed Nearest Neighbour or Hart algorithm examines a set of training data and removes those items that are likely to have little effect on classifications because they are located in the middle of a cluster where all their own neighbours share the same class. This leaves only the relevant training items and means that classifications of new data can proceed more rapidly.
- alias
- k-Nearest-Neighbour kNN
- subtype
- Condensed Nearest Neighbour Hart algorithm
- has functional building block
- FBB_Classification
- has input data type
- IDT_Vector of quantitative variables
- has internal model
- has output data type
- ODT_Classification
- has learning style
- LST_Supervised
- has parametricity
- PRM_Nonparametric with hyperparameter(s)
- has relevance
- REL_Relevant
- uses
- sometimes supports
- ALG_Least Squares Regression ALG_Logistic regression
- mathematically similar to