Latent Semantic Indexing

Supporting Technique

Latent Semantic Indexing (LSI) is a technique used for analyzing relationships between a set of documents and the terms they contain.

LSI is commonly used in information retrieval and text mining to identify patterns in the relationships between terms and concepts contained in unstructured text. For example, consider a collection of research papers where each paper is represented by the frequency of terms it contains. LSI can be used to reduce the dimensionality of this data, making it easier to identify clusters of papers that discuss similar topics.

LSI works by constructing a term-document matrix that describes the occurrences of terms in documents. It relates w words to d documents where the values in the matrix represent how often each word occurs in each document. This matrix is then decomposed using singular value decomposition into three smaller matrices that give an approximation of the original matrix when multiplied together.

  • The first matrix relates the w words to x dimensions, with x being a user-supplied hyperparameter and the row for each word is a word embedding. LSI thus represents one of the ways of generating word embeddings.

  • The second matrix is a Diagonal Matrix with x rows and x columns. It essentially forms a list of x weights that facilitate the approximate reproduction of the original matrix when the three matrices are recombined using matrix multiplication. It has no use on its own.

  • The third matrix relates the x dimensions to the d words and can be used as the input for document clustering.

Alias
Latent semantic analysis LSI LSA
Related terms
Probabilistic Latent Semantic Indexing Dimensionality Reduction Word Embeddings Document Clustering