msg.Machine Learning Catalogue

Data leakage describes the inappropriate use of information during model selection or training, that is not accessible during prediction.

Data leakage often results in an overestimation of the model’s performance and poor results when used in real-world scenarios with unseen data.

A common cause of data leakage is the failure to maintain separation between training and test datasets. It is crucial to avoid the use of test data when making decisions about the model.

Still, data leakage is easily overlooked, especially in pre-processing steps such as normalizing the data. Hence, these transformations should only be based on the training data. Including test data in the calculation of scalers or filters can already leak information about the test data to the model.

For example, if the preprocessing step involves normalization by dividing by the mean, the mean should be calculated using only the training subset to avoid any influence from the test subset.

A comprehensive user guide how to avoid data leakage can be found here

Related terms: Normalization Standardization

Data Leakage