msg.Machine Learning Catalogue

Least Squares Regression is used to model the effect of 1…n predictor variables on a dependent variable. It works by finding the optimal set of coefficients with which to multiply together each predictor variable to obtain an estimation of the dependent variable.

For example, let us presume that the gross national product of a country depends on the size of its population, the mean number of years spent in education and the unemployment rate. Least Squares Regression is a method to use training data to determine the optimal weightings to use with the three factors. These can then be used for one or both of the following:

estimating the gross national product of some new country for which the other three variables are known (value prediction). This is the main use of least squares regression, especially in business contexts;
explaining the model itself. However, it is important to remember that the fact that one variable is correlated with another does not imply causation: it could be that both variables are being affected by a third, possibly hidden one. Only once the causal factors for a given phenomenon have been established can least squares regression can be used to investigate their relative importance.

Least Squares Regression procedures are often sensitive to outliers, or individual pieces of training data that do not conform to the general pattern being described because they resulted either from one-off events or from mismeasurements. It can be helpful to eliminate outliers from the training data, but only if there is a theoretical basis for explaining them. Simply removing them because they are outliers introduces a dangerous bias into the learning calculation!

Regression is performed and described using matrix mathematics. This is because matrices are the most efficient way of modelling the relationships between corresponding sets of variable values. However, there is no need to understand the details in order to use least squares regression.

There is a spectrum of least squares regression procedures. At the one extreme are mathematically simple procedures that place a large number of constraints on the input data but can learn relatively efficiently from a relatively small training set. At the other extreme are mathematically more complex procedures that place fewer restrictions on their input data but which generally need much more training data to build an effective model. The more complex procedures tend to be more difficult to use successfully and finding the source of any errors that occur is more challenging. For these reasons, the simpler procedures should be preferred wherever possible. The list below is arranged starting with the simplest procedures and going on to the more complex ones.

Ordinary Least Squares Regression

Ordinary Least Squares Regression (OLSR) is the oldest type of regression. Where all the prerequisites are fulfilled, it can learn effectively with 10-15 training inputs for each predictor variable in the model (including any interaction terms, see below). OLSR places the following constraints on input data:

the factors that genuinely determine the dependent variable are contained within the list of predictor variables. If you do not have a relatively solid understanding of the interplay of the various factors, you are unlikely to be successful using OLSR. In the example above: if gross national product were really determined mainly by some other economic factor not listed, the procedure would have little hope of yielding a working model.

Preprocessing can play an important role in creating predictor variables that are mathematically suitable for inclusion in a regression model. For example, if all the values for a certain variable lie between 1000000 and 1000005, the difference between 1000000 and each value is likely to be a more appropriate predictor variable and yield better results than the original variable.
interdependencies between the predictor variables are:

a) too slight to be significant (the variables are practically mutually independent) or;

b) understood and expressed using additional factors called interaction terms. If the number of hours of sunlight and the number of mm of rain to which are plant has been exposed are both predictor variables, the relationship between the two variables could be captured using a third variable that equals the other two multiplied together. If interactions between predictor variables exist but are not captured in this way, least squares regression is liable to generate models that are too closely modelled on the training data, i.e. to overlearn;

c) eliminated using principal component analysis. Regression using principal components rather than the original input variables is referred to as principal component regression.

When two predictor variables have an exact linear relationship to each other, they are said to be perfectly correlated, a situation known as multicollinearity. In practice, this usually occurs because the same variable has mistakenly been added to the model twice. While less absolute interdependencies between predictor variables can lead to overlearning and erroneous results, multicollinearity makes the OLSR calculation mathematically impossible.
Least-squares regression presumes that the sampling errors for the predictor variables are normally distributed (Gaussian distribution). In practice this can often not be guaranteed but things will normally still work as long as the overall degree of error is not too great and the departure from the normal distribution is not too great.
the relationship between each predictor variable and the dependent variable is linear, meaning that increasing the predictor variable will cause the dependent variable to increase in a linearly corresponding fashion. Crucially, many non-linear relationships can still be captured in a linear fashion by redefining the predictor variable in preprocessing steps as described in point 1 above. For example, a predictor variable that causes an exponential increase in a dependent variable would not be a valid candidate for linear forms of least squares regression, but the logarithm of the same predictor variable would be.
The sampling errors for the predictor variables are mutually independent, and there are no repeating effects or autocorrelations within the sampling errors for individual variables over time (e.g. more errors at night than during the day),
The sampling error for each predictor variable is homoscedastic, meaning that the extent of the error does not vary with the value of the variable. An example of heteroscedasticity would be when a tax office asks employees to estimate their total income for the following financial year. Obviously, the margin of error will be much greater for a high-earner like a board member than for somebody receiving the minimum wage.

In practice, OLSR can still deliver usable results if this prerequisite is not fulfilled as long as the heteroscedasticity is not too great. Where it works, OLSR should then be preferred over more complex methods. One rule of thumb mentioned on the internet is that the error may vary up to fourfold before OLSR ceases to be useful. I found no way of verifying this figure, but it may still serve as a useful starting point.

Weighted Least Squares Regression

In Weighted Least Squares Regression, prerequisite 6 (homoscedasticity) is relaxed for the special case that the sampling error increases proportionally to the predictor variable value. This is offset by increasing the effect that lower values of the relevant predictor variables have on the model.

Weighted Least Squares Regression works well provided that:

you get the weights right. If there is no theoretical basis for modelling how the sampling error reacts to changes in the variable value, there are various techniques for deriving this information from the data. However, these can be vulnerable to overfitting.
you deal with any outliers in the lower portions of the variable range, as these will have a disproportionately deleterious effect on the model.

Generalized Least Squares Regression

In Generalized Least Squares Regression, prerequisites 5 (error independence) and 6 (homoscedasticity) are removed and a matrix is added into the equation that expresses all the ways in which variables can affect error, both in concert (prerequisite 5) and individually (prerequisite 6).

Unless there is a theoretical basis for supplying this matrix as given to the regression model, it is normally estimated by observing the deviation that each variable exhibits from the best-fit line predicted by ordinary least squares regression (OLSR). This is called Feasible Generalized Least Squares (FGLS) Regression or Estimated Generalized Least Squares (EGLS) Regression. Because there are an enormous number of ways in which variables could influence one another’s error, performing feasible generalized least squares regression for all possible combinations of predictor variables would need a very large amount of training data to yield a usable model. Instead, common sense is normally applied to determine in advance which variables are likely to be heteroscedastic and which pairs of variables are likely to affect each other’s error. Feasible generalized least squares regression is then performed for these terms only.

Non-linear Regression

Relaxing prequisite 4 (linearity) as well leads us into the realm of non-linear regression. This is mathematically more complex than the linear regression we have been discussing up until now. While linear regression can be solved with equations, non-linear regression has to rely on iterations to approach the optimal values. It also tends to require much more training data to work.

For these reasons, non-linear regression should only be ever used as a last resort after it has been definitively ascertained that there is no way of pre-processing variables to yield linear relationships as described under point 4 above.

alias
subtype: Linear Regression Ordinary Least Squares Regression (OLSR) Weighted Least Squares Generalized Least Squares Feasible Generalized Least Squares (FGLS) Estimated Generalized Least Squares (EGLS) Non-Linear Regression Principal component regression
has functional building block: FBB_Value prediction
has input data type: IDT_Vector of quantitative variables
has internal model: INM_Function
has output data type: ODT_Quantitative variable
has learning style: LST_Supervised
has parametricity: PRM_Parametric
has relevance: REL_Relevant
uses
sometimes supports
mathematically similar to