msg.Machine Learning Catalogue

Principal component analysis (PCA) is applied to a set of variable vectors to find a function or set of functions that can be applied to them to yield new vectors that have fewer dimensions but still do a good job of capturing the essence of the data.

PCA starts by finding a function that maximises the variance between the set of input vectors. This function then yields the first output dimension. If you imagine points scattered around a best-fit line within a three-dimensional space, PCA will find a function that expresses the points with relation to that line. If the same procedure is then repeated for the points minus the function yielded by this first iteration, a second function is derived that expresses the second best-fit dimension at right-angles to the first one, and so on for further dimensions. An excellent, short, intuitive video can be found here.

There are two important assumptions that must be correct for PCA to work:

1) Most crucially, PCA is sensitive to the scaling of the input dimensions. An input dimension with a large variance plays a more important role in determining the functions than an input dimension with a small variance. In a typical case where the scaling of the input dimensions is unrelated to their relative importance, it is therefore important to normalise all the dimensions before performing PCA.

2) It also presumes that the data is arranged according to a linear pattern that can be efficiently expressed using mutually orthogonal principal components. However, non-linear (higher-order) versions of the procedure have also been proposed that relax this assumption.

PCA could theoretically be used to explain the relationships between the variables that make up the vector dimensions (feature discovery), but factor analysis, which typically generates very similar results, is normally preferred for this job.

alias: PCA
subtype: Multilinear PCA N-way PCA
has functional building block: FBB_Dimensionality reduction
has input data type: IDT_Vector of quantitative variables
has internal model: INM_Function
has output data type: ODT_Vector of quantitative variables
has learning style: LST_Unsupervised
has parametricity: PRM_Nonparametric
has relevance: REL_Relevant
uses
sometimes supports: ALG_Least Squares Regression ALG_Nearest Neighbour
mathematically similar to: ALG_Factor analysis