Cross-validation

Supporting Technique

Cross-validation is a technique used to evaluate the performance of a machine learning model by partitioning the data into subsets and training/testing the model multiple times.

Cross-validation is commonly used in machine learning to assess the generalization ability of a model and to prevent overfitting by ensuring that the model performs well on unseen data.

Cross-validation works by dividing the dataset into k subsets or folds. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each fold being used as the test set once. The results are then aggregated to provide a more reliable estimate of the model’s performance.

Stratified cross-validation is a variation of k-fold cross-validation that ensures each fold contains approximately the same percentage of samples of each target class as the complete dataset. This is particularly useful for imbalanced datasets, as it ensures that each fold is representative of the overall class distribution.

Group k-fold cross-validation, ensures that the same group is not represented in both the training and testing sets. The folds do not necessarily have to be of equal size and do not share the same label distribution. This is useful when the data is obtained from different study subjects with several samples per subject. This helps detect overfitting situations where the model learns subject-specific features that do not generalize to new contexts. For example, if you train a model on source-code from 10 different projects, using a project-wise cross-validation is advisable to make assumptions how the model will behave when applied to new unseen projects.

One challence is aggregating the evaluation metrics of the iterations into a single performance measure:

  • Macro-average computes the metric independently for each class and then takes the average, treating all classes equally.
  • Micro-average aggregates the data points across all classes and runs to compute the average metric, which is useful when classes are imbalanced. For more detailed information, refer to, among other sources, this user guide

Cross-validation is an essential technique in machine learning to build robust and generalizable models by providing a reliable estimate of model performance on unseen data.

Alias
Related terms
Overfitting Stratified folds Macro-average Micro-average