A dimensionality reduction function takes input data where each item consists of a large number of variables and outputs data that has a smaller number of variables. The output variables can either be directly selected from the input variables or derived from input variables using functions that combine them.
The output from a dimensionality reduction function is often used as the input to further machine learning algorithms, which typically go on to perform classification and value prediction. A good dimensionality reduction procedure will output variables that are best suited to being used as input in a subsequent stage. Bases for determining this include:
- especially in supervised learning, correlation with training classifications or values;
- especially in unsupervised learning, “interestingness” or “salience” defined as difference from “expected” or “average” values.
Dimensionality reduction functionality is strongly associated with the necessity to consolidate heavily correlated (“collinear”) input variables before input data can be fed to certain algorithms, most notably linear regression and similar techniques. Generally, it can also be used to improve accuracy and to decrease training time. See also this article.
- Feature selection