Probabilistic latent semantic indexing

Algorithm

Probabilistic latent semantic indexing is used to cluster documents on the basis of term frequencies (how often each document contains each word). It works by modelling a probability distribution where the observed document and term frequency variables are mediated by a hidden topic variable. Expectation maximization is used to find the topic model that best predicts the observed data. The number of topics is a user-supplied hyperparameter. The output of the trained algorithm for a given document is list of probabilities that each of the unlabelled topics discovered during training pertains to the document.

PLSI uses internal parameters that can be seen as equivalent to the values derived by standard LSI. Nevertheless, the two algorithms are too different for PLSI to be successfully categorised as a simple sub-type of LSI:

  • The functional building block of PLSI is always classification, while standard LSI is first and foremost used for dimensionality reduction, even if its output is often used as the input to some other classification algorithm in a second step.
  • LSI uses linear algebra, while PLSI is an iterative process based on expectation maximization.

Latent Dirichlet Allocation (LDA) is a subtype of PLSI with two additional hyperparameters that cause the algorithm to prefer solutions where documents have relatively fewer topics (α hyperparameter) and where topics are characterised by relatively few words (β hyperparameter). The hyperparameters have values between 0 and 1 where lower values place a greater restriction on the number of topics/words. LDA typically yields better results than vanilla PLSI, corresponding to the intuitions that most documents are indeed about only a small number of topics and that most topics are indeed characterised by only a small number of words.

alias
Probabilistic latent semantic analysis PLSI PLSA
subtype
Latent Dirichlet Allocation LDA
has functional building block
FBB_Classification
has input data type
has internal model
INM_Probability
has output data type
ODT_Classification ODT_Probability
has learning style
LST_Unsupervised
has parametricity
PRM_Nonparametric with hyperparameter(s)
has relevance
REL_Relevant
uses
ALG_Expectation maximization
sometimes supports
mathematically similar to