Probabilistic latent semantic indexing (PLSI) is used to cluster documents on the basis of term frequencies, i.e. how often each document contains each word.
PLSI is used in scenarios where discovering hidden topics in a collection of documents is important, such as in text mining, information retrieval, and natural language processing. It helps in identifying the underlying thematic structure of a document corpus.
PLSI works by modeling a probability distribution where the observed document and term frequency variables are mediated by a hidden topic variable. Expectation maximization is used to find the topic model that best predicts the observed data. The number of topics is a user-supplied hyperparameter. The output of the trained algorithm for a given document is a list of probabilities that each of the unlabelled topics discovered during training pertains to the document.
For example, given a set of news articles, PLSI can identify topics such as “politics,” “sports,” and “technology,” and assign probabilities to each article indicating the relevance of these topics.
PLSI uses internal parameters that can be seen as equivalent to the values derived by standard latent semantic indexing. Nevertheless, the two algorithms are too different for PLSI to be successfully categorised as a simple sub-type of LSI:
- The functional building block of PLSI is always classification, while standard LSI is first and foremost used for dimensionality reduction, even if its output is often used as the input to some other classification algorithm in a second step.
- LSI uses linear algebra, while PLSI is an iterative process based on expectation maximization.
Latent Dirichlet Allocation (LDA) is a subtype of PLSI with two additional hyperparameters that cause the algorithm to prefer solutions where documents have relatively fewer topics (α hyperparameter) and where topics are characterised by relatively few words (β hyperparameter). The hyperparameters have values between 0 and 1 where lower values place a greater restriction on the number of topics/words. LDA typically yields better results than vanilla PLSI, corresponding to the intuitions that most documents are indeed about only a small number of topics and that most topics are indeed characterised by only a small number of words.
- Alias
- Probabilistic latent semantic analysis PLSI PLSA
- Related terms
- Latent Dirichlet Allocation Topic Modeling Expectation Maximization