Latent semantic indexing

Algorithm

The input to latent semantic indexing is a term-frequency matrix relating w words to d documents where the values in the matrix represent how often each word occurs in each document. This matrix is decomposed into three smaller matrices that give an approximation of the original matrix when multiplied together. Two of these three matrices have their own individual uses:

1) The first matrix relates the w words to x dimensions (x being a user-supplied hyperparameter) and the row for each word is a word embedding. LSI thus represents one of the ways of generating word embeddings.

2) The second matrix is a diagonal matrix with x rows and x columns. It essentially forms a list of x weights that facilitate the approximate reproduction of the original matrix when the three matrices are recombined using matrix multiplication. It has no use on its own.

3) The third matrix relates the x dimensions to the d words and can be used as the input for document clustering using a standard algorithm like k-means.

See also probabilistic latent semantic indexing.

alias
Latent semantic analysis LSI LSA
subtype
has functional building block
FBB_Dimensionality reduction
has input data type
IDT_Vector of quantitative variables
has internal model
has output data type
ODT_Vector of quantitative variables
has learning style
LST_Unsupervised
has parametricity
PRM_Nonparametric with hyperparameter(s)
has relevance
REL_Relevant
uses
sometimes supports
ALG_k-means
mathematically similar to