Bagging is a technique that can be applied to a large variety of algorithms. The procedure is:
- Multiple sets of training data are generated from the original training data by random sampling that allows the presence of duplicates within each sampled set. The sampled sets may be smaller than or the same size as the original training data. How many sampled sets yield the best result is basically a matter of trial and error, although a good rule of thumb when using bagging for classification is to start with the same number of sampled sets as there are class labels.
- The main machine learning procedure is carried out separately on each sampled set.
- When using the model for classification or value prediction of new data, the data is run through each of the generated models separately and the obtained results averaged to yield the final result. The arithmetic mean is typically used as the average for value-prediction use cases and the mode for classification use cases. Weight-adjusted bagging is a subtype that measures the accuracy of each generated model against a second set of training data and then takes the results into account using a weighting parameter when processing new input.
The advantage of bagging as opposed to just training a single model using the original training data is that it tends to be less sensitive to overfitting / overlearning, especially where models are unstable (small changes in the input lead to large changes in the output, either because this is an inherent property of the model or because the training data is inaccurate).