Principal component analysis (PCA) is applied to a set of variable vectors to find a function or set of functions that can be applied to them to yield new vectors that have fewer dimensions but still do a good job of capturing the essence of the data.
PCA is used in scenarios where reducing the dimensionality of data is important, such as in exploratory data analysis, visualization, and feature extraction. It helps in simplifying the data while retaining the most significant structures and patterns.
PCA works by finding a function that maximizes the variance between the set of input vectors. This function then yields the first output dimension. If you imagine points scattered around a best-fit line within a three-dimensional space, PCA will find a function that expresses the points with relation to that line. If the same procedure is then repeated for the points minus the function yielded by this first iteration, a second function is derived that expresses the second best-fit dimension at right angles to the first one, and so on for further dimensions. An excellent, short, intuitive video can be found here.
For example, in a dataset with three variables, PCA can reduce the data to two principal components that capture the most variance in the data, making it easier to visualize and analyze.
There are two important assumptions that must be correct for PCA to work: 1) PCA is sensitive to the scaling of the input dimensions. An input dimension with a large variance plays a more important role in determining the functions than an input dimension with a small variance. It is therefore important to normalize all the dimensions before performing PCA. 2) PCA presumes that the data is arranged according to a linear pattern that can be efficiently expressed using mutually orthogonal principal components. Non-linear (higher-order) versions of the procedure have also been proposed that relax this assumption.
PCA could theoretically be used to explain the relationships between the variables that make up the vector dimensions, but factor analysis, which typically generates very similar results, is normally preferred for this job.
- Alias
- PCA
- Related terms
- Multilinear PCA N-way PCA t-SNE