msg.Machine Learning Catalogue

The input to latent semantic indexing is a term-frequency matrix relating w words to d documents where the values in the matrix represent how often each word occurs in each document. This matrix is decomposed into three smaller matrices that give an approximation of the original matrix when multiplied together. Two of these three matrices have their own individual uses:

1) The first matrix relates the w words to x dimensions (x being a user-supplied hyperparameter) and the row for each word is a word embedding. LSI thus represents one of the ways of generating word embeddings.

2) The second matrix is a diagonal matrix with x rows and x columns. It essentially forms a list of x weights that facilitate the approximate reproduction of the original matrix when the three matrices are recombined using matrix multiplication. It has no use on its own.

3) The third matrix relates the x dimensions to the d words and can be used as the input for document clustering using a standard algorithm like k-means.

alias: Latent semantic analysis LSI LSA
subtype
has functional building block: FBB_Dimensionality reduction
has input data type: IDT_Vector of quantitative variables
has internal model
has output data type: ODT_Vector of quantitative variables
has learning style: LST_Unsupervised
has parametricity: PRM_Nonparametric with hyperparameter(s)
has relevance: REL_Relevant
uses
sometimes supports: ALG_k-means
mathematically similar to

Latent semantic indexing