Data Normalization is a technique used to scale numerical features to a common range without distorting differences in the ranges of values.
Normalization is commonly used in preprocessing steps of machine learning pipelines, especially when the features have different scales and units. It ensures that each feature contributes equally to the model’s performance and helps in speeding up the convergence of gradient-based algorithms.
Normalization typically scales the data to a range of [0, 1] or [-1, 1] using the formula (x - min) / (max - min)
.
This transformation maintains the relationships between the data points while adjusting the scale.
For example, if you have a dataset with features height
in centimeters and weight
in kilograms, normalization will scale both features to a common range, ensuring that height
and weight
contribute equally to the model.
It is important to apply normalization and standardization only on the training data and not on the test data to prevent information leakage from the test set into the training process.
Standardization is a technique similar to normalization. Both are techniques used to adjust the scale of features, but they differ in their approach. Normalization scales the data to a fixed range, typically [0, 1] or [-1, 1], while Standardization transforms the data to have a mean of 0 and a standard deviation of 1. Standardization is useful when the data follows a Gaussian distribution and is often used in algorithms that assume normally distributed data. Normalization is preferred when the data does not follow a Gaussian distribution or when the algorithm does not make assumptions about the data distribution.
- Alias
- Feature Scaling Min-Max Scaling
- Related terms
- Standardization Data Leakage