Dimensionality Reduction

Supporting Technique

Dimensionality reduction is the process of reducing the number of variables under consideration by obtaining a set of principal variables.

Dimensionality reduction is commonly used in machine learning to simplify models, reduce overfitting, and decrease computational cost by transforming high-dimensional data into a lower-dimensional form.

Dimensionality reduction works by either selecting a subset of the original variables (feature selection) or transforming the original variables into a new set of variables (feature extraction) that retain the most important information.

For example, Principal Component Analysis (PCA) is a widely used dimensionality reduction technique that transforms the original variables into a new set of uncorrelated variables called principal components. These principal components capture the maximum variance in the data, allowing for a more compact representation. PCA is particularly useful for visualizing high-dimensional data and reducing noise.

Another example is t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a nonlinear dimensionality reduction technique used for visualizing high-dimensional data. t-SNE works by converting the similarities between data points into joint probabilities and minimizing the Kullback-Leibler divergence between these joint probabilities in the high-dimensional space and the low-dimensional space. This results in a map where similar objects are modeled by nearby points and dissimilar objects are modeled by distant points.

The output from a dimensionality reduction function is often used as the input to further machine learning algorithms, which typically go on to perform classification and value prediction. A good dimensionality reduction procedure will output variables that are best suited to being used as input in a subsequent stage. Bases for determining this include:

  • Especially in supervised learning, correlation with training classifications or values.
  • Especially in unsupervised learning, “interestingness” or “salience” defined as difference from “expected” or “average” values.

Dimensionality reduction is an essential technique in machine learning to build efficient, interpretable, and robust models by reducing the complexity of the data.

Alias
Related terms
Feature selection