A vector of categorical variables is a one-dimensional set of values where each value is an assignment to one of a finite range of categories.
For example, if a house is white and terraced, these two facts make up a vector of categorical variables that describe it.
Note: If more than one categorical variable is predicted by a machine learning model, this is referred to multi-label or multi-output classification.
Depending on the use case, categorical data can be converted to quantitative data, reducing the classification to a regression task.
Importantly, this is only recommended for ordinal classification tasks with an intrinsic order or hierarchy to the choices (e.g., 1 = excellent
, 2 = good
, 3 = poor
).
The results will be unsatisfactory when a categorical variable is used to capture an unordered range of choices (e.g., 1 = Germany
, 2 = France
, 3 = UK
), as the model will assume that the UK is larger than France and three times as large as Germany, while France would be treated as numerically between Germany and the UK.
These hidden assumptions can lead to unexpected effects and should be avoided.
In this catalogue, we include probabilities that a data point belongs to a given class as categorical variables. These soft labels or prediction probabilities are expressed as numbers between 0 and 1 and are thus fundamentally different from quantitative variables, which may have infinite ranges.
- Alias
- Categorical Vector
- Related terms
- Binary Vectors Multi-label Classification Multi-output