Vector of categorical variables

Input data type

A vector of categorical variables is a one-dimensional matrix or set of values where each value is an assignment to one of a finite range of variables (cf. vector of quantitative variables). For example, if a house is white and terraced, these two facts make up a vector of categorical variables that describe it.

In this catalogue, we subsume under categorical variables probabilities that an input data value belongs to a given class. These are expressed as a number between 0 and 1 and are thus fundamentally different from quantitative variables with their potentially infinite ranges.

Quantitative data can be converted to categorical data by assigning labels to sections of the value range, itself a classification task. This is normally only recommended when the data is clearly clustered around certain positions within the range.

Importantly, a set of categories should only be captured as a categorical variable where there is an intrinsic order or hierarchy to the choices (e.g. 1 = terraced, 2 = semi-detached, 3 = detached). The results will be unsatisfactory when a categorical variable is used to capture an unordered range of choices (e.g 1 = Bavaria, 2 = Hessen, 3 = Berlin); in such cases, separate binary variables should be used instead:

  • isBavaria = 0/1; isHessen = 0/1; isBerlin= 0/1, or
  • isBavaria = 0/1; isHessen = 0/1,with Berlin defined as both variables being set to zero.
used by
ALG_Actor-critic ALG_Adaptive resonance theory network ALG_Association rule learning ALG_Averaged one-dependence estimators ALG_Bayesian network ALG_Decision tree ALG_Deep Q-network ALG_Long short-term memory network ALG_Markov random field ALG_Monte-Carlo tree search ALG_Naive Bayesian Classifier ALG_Neural actor-critic ALG_One Rule ALG_Perceptron ALG_Q-learning ALG_Random forest ALG_SARSA ALG_Temporal difference learning