The input to latent semantic indexing is a term-frequency matrix relating w words to d documents where the values in the matrix represent how often each word occurs in each document. This matrix is decomposed into three smaller matrices that give an approximation of the original matrix when multiplied together. Two of these three matrices have their own individual uses:
1) The first matrix relates the w words to x dimensions (x being a user-supplied hyperparameter) and the row for each word is a word embedding. LSI thus represents one of the ways of generating word embeddings.
2) The second matrix is a diagonal matrix with x rows and x columns. It essentially forms a list of x weights that facilitate the approximate reproduction of the original matrix when the three matrices are recombined using matrix multiplication. It has no use on its own.
3) The third matrix relates the x dimensions to the d words and can be used as the input for document clustering using a standard algorithm like k-means.
See also probabilistic latent semantic indexing.
- alias
- Latent semantic analysis LSI LSA
- subtype
- has functional building block
- FBB_Dimensionality reduction
- has input data type
- IDT_Vector of quantitative variables
- has internal model
- has output data type
- ODT_Vector of quantitative variables
- has learning style
- LST_Unsupervised
- has parametricity
- PRM_Nonparametric with hyperparameter(s)
- has relevance
- REL_Relevant
- uses
- sometimes supports
- ALG_k-means
- mathematically similar to