Feature Scaling and Transformation

4 min readOct 8, 2021

Why is it required?

Many machine learning algorithms like linear regression select the optimal fitting model by optimizing the parameters so as to minimize the cost function. This is done using a method called Gradient Descent.
This algorithm determines the optimal value of the parameters by changing the parameters with a size in the direction of the steepest descent of the cost function (gradient function has the direction same as the direction of steepest descent of the function). The step size is dependent on the value of the parameter as well and hence if the parameters are not on the same scale, the step size of each parameter will be different and the algorithm will converge very slowly.
Algorithms like KNN use Euclidean Distance. If the features are not on the same scale, one feature will simply out-weigh the others and the algorithm will fluster.
Many statistical tests require the data to be normally distributed which rarely is true.
Some variables are not in the format we need for a certain question, e.g. car manufactures supply miles/gallon values for fuel consumption, however for comparing car models we are more interested in the reciprocal gallons/mile. Thus this helps in increasing interpretability.

Techniques of Transformation

Standardization: This is a technique in which the values are scaled in such a way that the mean of the values is 0 and variance is 1.
Scaling is done using the following formula:

Where, Xₙ represents the new value of X, μ represents the mean and σ represents the standard deviation of the feature.

Standardization should be used when we know that our variable/feature is following normal distribution. The process converts the feature into standard normal variate and also, this transformation has no range.

2. Normalization: This technique transforms the feature on a scale of 0 and 1. It uses the maximum and the minimum value of the feature to perform this transformation.
The transformation is performed as follows:

Where Xmax is the max value of the feature and Xmin is the min value of the feature

This process is somewhat efficient to outliers as it scales the entire range between 0 and 1

3. Transformation: It is the application of the same calculation to every point of the data separately

How to transform data?

To get insights, data is most often transformed to follow close to a normal distribution either to meet statistical assumptions or to detect linear relationships between other variables. One of the first steps for those techniques is to check how close the variables already follow a normal distribution.

How to check if your data follows a normal distribution?

It is common to inspect your data visually and/or check the assumption of normality with a statistical test.

Variable distribution histogram and corresponding QQ-plot with reference line of a perfect normal distribution

To visually explore the distribution of your data, we will look at the density plot as well as a simple QQ-plot. The QQ-plot is an excellent tool for inspecting various properties of your data distribution and asses if and how you need to transform your data. Here the quantiles of a perfect normal distribution are plotted against the quantiles of your data. Quantiles measure at which data point a certain percentage of the data is included.

Which transformation to pick?

Right (positive) skewed data:

Root ⁿ√x : It is the weakest of all transformations. For negative numbers special care needs to be taken with the sign while transforming negative numbers
Logarithm log(x) : It is a quite famous one and very commonly used. It can not be used on negative numbers or 0, here you need to shift the entire data by adding at least |min(x)|+1.
Reciprocal 1/x : It is the strongest of all the transformations. This transformation should not be done with negative numbers and numbers close to zero, hence the data should be shifted similar as the log transform.

Left (negative) skewed data

Reflect Data and use the appropriate transformation for right skew. Reflect every data point by subtracting it from the maximum value. Add 1 to every data point to avoid having one or multiple 0 in your data.
Square x². Stronger with higher power. Can not be used with negative values.
Exponential eˣ. Strongest transformation and can be used with negative values. Stronger with higher base.Automatic Transformations

Automatic Transformations

There are various implementations of automatic transformations in R that choose the optimal transformation expression for you. They determine a lambda value which is the power coefficient used to transform your data closest to a normal distribution.

Tukey’s Ladder of Powers. For skewed data, the implementation transformTukey()from the R package rcompanion uses Shapiro-Wilk tests iteratively to find at which lambda value the data is closest to normality and transforms it. Left skewed data should be reflected to right skew and there should be no negative values.

Tukey’s Ladder of Powers lamda values and corresponding power transforms. Lambda values can be decimal.

Box-Cox Transformation. The implementation BoxCox.lambda()from the R package forecast finds iteratively a lambda value which maximizes the log-likelihood of a linear model.