Advance Regression - Ridge, Lasso, Gradient Descent

Constrained Minimization

Constrained minimisation is a process in which we try to optimise an objective function with respect to some variables in the presence of the constraints on the variables or the function. In our case, the objective function is a cost function.

  • Ridge/Lasso Regression and SVM use Constrained Minimization
  • The solution to the overall minimisation problem is the point where the error is minimum while satisfying the additional constraint.

Ridge and Lasso Regression

  • Adding constraints over the weights is called regularization and ridge and lasso are two common techniques used for this purpose.

Signature of overfitting in Polynomial Regression

  • The weights of the coefficients are very large
  • In regularization, we combat overfitting by controlling the model's complexity, i.e. by introducing an additional term in our cost function in-order to penalize large weights. This biases our model to be simpler, where simpler is weights of smaller magnitude (or even zero). We want to make the weights smaller because complex models and overfitting are characterized by large weights.
  • Adding constraints over the weights / coefficients is called regularization and it helps in resolving overfitting issue because having large weights is one of the signs off overfitting so if we put a constraint over weights, that will make sure large weights are not assigned and hence model does not overfit.

Ridge

  • Constraint over sum of squared weights (wi2\sum w_i^2 < r2r^2)
  • represents a region bounded by a circle in 2D space
  • Ordinary least squares with L2 regularisation is known as Ridge Regression.
  • In L2 regularisation, large weights are being penalized much more
  • Ex: i=19wi21000\displaystyle \sum_{i=1}^{9}w_i^2 \leq 1000

Lasso

  • Constraint over sum of absolute weights (wi\sum |w_i| < rkr^k)
  • represents a region bounded by a square in 2D space
  • Gives sparse solution
  • In L1 regularization the model's parameters become sparse during optimization, i.e. it promotes a larger number of parameters w to be zero. This is because smaller weights are equally penalized as larger weights
  • This sparse property is often quite useful. For example, it might help us identify which features are more important for making predictions, or it might help us reduce the size of a model (the zero values don't need to be stored).
  • Ex: i=19wi100\displaystyle \sum_{i=1}^{9}|w_i| \leq 100

Unconstrained Minimization

Unconstrained minimisation, on the other hand, is a process in which we try to optimise the objective function with respect to some variables without any constraints on those variables.

  • Linear Regression and Logistic Regression use Unconstrained Minimization
  • There is no explicit condition on the parameters but they are in turn calculated with the condition that the cost function needs to be minimized.
  • Gradient Descent is one way of solving unconstrained minimisation

J(θ)=1.2(θ2)2+3.2J(\theta) = 1.2(\theta - 2)^2 + 3.2

Closed form

J(θ)=0    2.4(θ2)    θopt=2J'(\theta) = 0 \implies 2.4(\theta - 2) \implies \theta_{opt} = 2

Gradient Descent

1D Gradient Descent

θnew=θoldηJθθ=θold\displaystyle \theta_{new} = \theta_{old} - \eta\frac{\partial J}{\partial \theta}|_{\theta = \theta_{old}}

θnew=θoldη(2.4)(θold2)\displaystyle \theta_{new} = \theta_{old} - \eta(2.4){(\theta_{old}-2)}

Let's assume the following starting conditions:

  • θ=1\theta = 1
  • η=0.1\eta = 0.1

Substituting,

  • θnew=10.1(2.4)(1)=1.24\theta_{new} = 1 - 0.1(2.4)(-1) = 1.24
  • θnew=1.240.1(2.4)(0.76)=1.42\theta_{new} = 1.24 - 0.1(2.4)(-0.76) = 1.42

In gradient descent, you start with some initial value and then gradually approach the optimal solution.

Steps of Gradient Descent

  1. We try to find the optimal minima by using Gradient Descent algorithm
  2. Choose the values of eta (learning rate) and initial theta (optimal value to be obtained)
  3. Use the formula

    • θnew=θoldlearning rate(partial derivative of cost function over θold)\displaystyle \theta_{new} = \theta_{old} - \text{learning rate}*(\text{partial derivative of cost function over } \theta_{old})
  4. Continue substituting new value to old value till the value becomes similar over several iterations

In other words,

  1. Choose a starting point X0X_0
  2. Beginning at X0X_0, generate a sequence of iterates Xk=oinfX^{inf}_{k=o} with respect to a cost function (f) value which has to be minimized until a solution with sufficient accuracy is found or until no further progress can be made.(inf=infinity)

2D Gradient Descent

J(θ1,θ2)=0\displaystyle \triangledown J(\theta_{1}, \theta_2) = 0

J=[Jθ1Jθ2]=0=0    [θ1newθ2new]\displaystyle \triangledown J = \begin{bmatrix} \frac{\partial J}{\partial \theta_1} \\ \\ \frac{\partial J}{\partial \theta_2} \\ \end{bmatrix} \begin{matrix} = 0 \\ \\ = 0 \end{matrix} \implies \begin{bmatrix} \theta_1^{new} \\ \\ \theta_2^{new} \end{bmatrix}=[θ1oldθ2old]\begin{bmatrix} \theta_1^{old} \\ \\ \theta_2^{old} \end{bmatrix} η[Jθ1Jθ2]θ=θold-\eta \begin{bmatrix} \frac{\partial J}{\partial \theta_1} \\ \\ \frac{\partial J}{\partial \theta_2} \\ \end{bmatrix}_{\theta=\theta^{old}}

For the case of Linear Regression

J(m,c)=i=1n(yi(mxi+c))2\displaystyle J(m,c) = \sum_{i=1}^{n}(y_i-(mx_i+c))^2

[mc]new\begin{bmatrix} m \\ \\ c \end{bmatrix}^{new}=[mc]old\begin{bmatrix} m \\ \\ c \end{bmatrix}^{old} η[JmJc][mc]old-\eta \begin{bmatrix} \frac{\partial J}{\partial m} \\ \\ \frac{\partial J}{\partial c} \\ \end{bmatrix}_{\begin{bmatrix}m\\c\end{bmatrix}^{old}}

Jm=2i=1n(yi(mxi+c))(xi)\displaystyle \frac{\partial J}{\partial m} = 2\sum_{i=1}^{n}(y_i-(mx_i+c))(-x_i)

Jc=2i=1n(yi(mxi+c))(1)\displaystyle \frac{\partial J}{\partial c} = 2\sum_{i=1}^{n}(y_i-(mx_i+c))(-1)

Metrics

  • Overall sense of error of the model
  • Smaller the RSS, closer is the model fit

RSS

yi=β0+β1xi+ϵi\displaystyle y_i = \beta_0 + \beta_1x_{i} + \epsilon_i

y^i=β0+β1xi\displaystyle \hat y_i = \beta_0 + \beta_1x_{i}

RSS=i=1Nϵi2=i=1N(yiy^i)2=i=1N(yibob1xi)2RSS = \displaystyle \sum_{i=1}^{N}\epsilon_i^2 = \sum_{i=1}^{N}(y_i - \hat y_i)^2 = \sum_{i=1}^{N}(y_i - bo - b_1x_i)^2

(RSS)β0    β0=yˉβ1xˉ\displaystyle \frac{\partial (RSS)}{\partial \beta_0} \implies \beta_0 = \bar y - \beta_1 \bar x

(RSS)β1    β1=i=1N(xxˉ)(yyˉ)i=1N(xxˉ)2\displaystyle \frac{\partial (RSS)}{\partial \beta_1} \implies \beta_1 = \frac{\sum_{i=1}^{N}(x-\bar x)(y - \bar y)}{\sum_{i=1}^{N}(x-\bar x)^2}

Mean Square Error

MSE=RSSn\displaystyle MSE = \frac{RSS}{n}

Root Mean Square Error

RMSE=MSE\displaystyle RMSE = \sqrt{MSE}

SLR

for i = i to n, yi=β0+β1xi+ϵi\displaystyle y_{i} = \beta_0 + \beta_1x_{i} + \epsilon_i

for n observations, equations can be written as:

  • y1=β0+β1x1y_{1} = \beta_0 + \beta_1x_{1}
  • y2=β0+β1x2y_{2} = \beta_0 + \beta_1x_{2}
  • \ldots
  • yn=β0+β1xny_{n} = \beta_0 + \beta_1x_{n}

This can be written more efficient in matrix notation:

[y1y2yn]=[1x11x21xn][β0β1]+[ϵ1ϵ2ϵn]\displaystyle \begin{bmatrix}y_1\\y_2\\\ldots\\y_{n}\end{bmatrix} = \begin{bmatrix}1&x_1\\1&x_2\\\ldots&\ldots\\1&x_{n}\end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \begin{bmatrix}\epsilon_1\\\epsilon_2\\\ldots\\\epsilon_n\end{bmatrix}

In more concise form:

Y=Xβ+ϵ\displaystyle Y = X\beta + \epsilon

here,

  • Y: Response Vector
  • X: Design matrix
  • β\beta: Coefficient Vector
  • ϵ\epsilon: Error Vector

Benefits of Using Matrices

  • Formulae become simpler, and more compact and readable.
  • Code using matrices runs much faster than explicit ‘for’ loops.
  • Python libraries, such as NumPy, help us build n-dimensional arrays, which occupy less memory than Python lists and computation is also faster.

SLR Equation in Matrix Form

β^=(XT.X)1.XT.Y\displaystyle \widehat{\beta}=(X^{T}.X)^{-1}.X^{T}.Y

MLR

for i = i to n,

  • yi=β0+β1xi,1+β2xi,2++βkxi,k+ϵi\displaystyle y_{i} = \beta_0 + \beta_1x_{i,1} + \beta_2x_{i,2} + \ldots + \beta_kx_{i,k} + \epsilon_i,
  • where k is the number of variables in the model

Matrix Notation:

[y1y2yn]=[1x1,1x1,2x1,k1x2,1x2,2x2,k1xn,1xn,2xn,k][β0β1β2βk]+[ϵ1ϵ2ϵn]\displaystyle \begin{bmatrix}y_1\\y_2\\\ldots\\y_{n}\end{bmatrix} = \begin{bmatrix}1&x_{1,1} & x_{1, 2}\ldots&x_{1,k}\\1 &x_{2,1} & x_{2, 2}\ldots&x_{2,k}\\ \ldots&\ldots\\1& x_{n,1} & x_{n, 2}\ldots&x_{n,k}\\ \end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\\\beta_2\\\ldots\\\beta_k\end{bmatrix} + \begin{bmatrix}\epsilon_1\\\epsilon_2\\\ldots\\\epsilon_n\end{bmatrix}

Dimension: Y(n,1)=X(n,k+1).β(k+1)+ϵ(n,1)\displaystyle Y(n,1) = X(n, k+1).\beta(k+1) + \epsilon(n,1)

It can still be written as:

Y=Xβ+ϵ\displaystyle Y = X\beta + \epsilon

here,

  • Y: Response Vector
  • X: Design matrix
  • β\beta: Coefficient Vector
  • ϵ\epsilon: Error Vector

Residual: ϵ=YXβ\displaystyle \Large \epsilon = \normalsize Y - X\beta

Questions

How will you identify the presence of heteroscedasticity in the residuals?

  • Plot residuals vs the predicted values and see of there is a consistent change in the residuals as we move from left of the x axis to the right.

How would you check for the assumptions of Linear Regression?

  • Scatter Plot of residuals vs y_pred
  • Histogram Plot of residuals

Identifying Non-Linearity in Data

  • For SLR

    • scatter plot and check non-linear patterns
  • For MLR

    • check residuals vs predictions plot for non-linearity
    • residuals are scattered randomly around 0
    • spread of residuals should be constant
    • no outliers in the data
    • If non-linearity is present, then we may need to plot each predictor against the residuals to identify which predictor is nonlinear.

Handling Non-Linear Data

  • Polynomial regression
  • Data Transformation
  • Non-Linear Regression
  • Polynomial regression and data transformation allow to use the linear regression framework to estimate model coefficients

Polynomial Regression

y^=β0+β1xi+β2xi2\displaystyle \hat y = \beta_0 + \beta_1x_i + \beta_2x_{i}^{2}

Substitute the variable xix_{i} as x1x_1 and xi2x_{i}^{2} as x2x_2

y^=β0+β1x1+β2x2\displaystyle \hat y = \beta_0 + \beta_1x_1 + \beta_2x_{2}

This way, we can express non-linear data in linear regression model


The kth-order polynomial model in one variable is given by:

y=β0+β1x+β2x2+β3x3++βkxk+ϵy = β_0 + β_1x + β_2x^{2} + β_3x^3 + \ldots + β_kx^k + \epsilon

If xj=xjx_j = x^j and j = 1, 2, ..., k, then the model is a multiple linear regression model with k predictor variables, x1,x2,,xkx_1, x_2, \ldots, x_k. Thus, polynomial regression can be considered an extension of multiple linear regression and, hence, we can use the same technique used in multiple linear regression to estimate the model coefficients for polynomial regression.

Questions

On inspection of the relationship between one predictor variable (a) and the response variable (y), you identify that the two have a cubic relationship. In the final model, which predictors will you include?

  • x,x2,x3x, x^{2}, x^{3}
  • Since this is a cubic fit, we need to include third degree of the predictor. In polynomial regression, we need to include the lower degree polynomials in the model as well. Hence, we include all three predictors as mentioned in the answer.
  • Model Equation: yi=β0+β1xi+β2x2+β3x3+ϵy_{i} = \beta_0 + \beta_1x_{i} + \beta_2x^{2} + \beta_3x^{3} + \epsilon

Data Transformation

  • both response and predictors can be transformed
  • One can take a log transform over data if there is sharp upward trend which then normalizes
  • y^i=β0+β1xi\displaystyle \hat y_{i} = \beta_0 + \beta_1x_{i}
  • After log\log transform: y^i=β0+β1log(xi)\displaystyle \hat y_{i} = \beta_0 + \beta_1\log(x_{i})
  • There are other transformations one can perform. Refer this article for more information.

Common-Equations-Dance-Moves

When to do transformation

  • If there is a non-linear trend in the data, the first thing to do is transform the predictor values.
  • When the problem is the non-normality of error terms and/or unequal variances are the problems, then consider transforming the response variable; this can also help with non-linearity.
  • When the regression function is not linear and the error terms are not normal and have unequal variances, then transform both the response and the predictor.
  • In short, generally:

    • Transforming the y values helps in handling issues with the error terms and may help with the non-linearity
    • Transforming the x values primarily corrects the non-linearity

Questions

What is the equation for exponential transformation?

  • y^=β0+β1exi+ϵi\displaystyle \hat y = \beta_0 + \beta_1e^{x_{i}} + \epsilon_i

Give Examples where Linear Regression (even with transformation) cannot be applied

  • y=β1x1+β2ex2+x3+β3sin(β4x4)\displaystyle y = \beta_1x_1 + \beta_2e^{x_2+x_3}+\beta_3sin(\beta_4x_4)
  • yi=β11+eβ2+bet3xi+ϵi\displaystyle y_{i} = \frac{\beta_1}{1+e^{\beta_2+bet_3x_{i}}} + \epsilon_i

After transforming the data in case of nonlinear relationship between the predictor and response variable, how do we assess whether the data transformation was appropriate?

  • We assess this by checking the residual plots for any violation in assumptions.

Pit Falls of Linear Regression

Non-constant variance

Constant variance of error terms is one of the assumptions of linear regression. Unfortunately, many times, we observe non-constant error terms. As discussed earlier, as we move from left to right on the residual plots, the variances of the error terms may show a steady increase or decrease. This is also termed as heteroscedasticity.

When faced with this problem, one possible solution is to transform the response Y using a function such as log or the square root of the response value. Such a transformation results in a greater amount of shrinkage of the larger responses, leading to a reduction in heteroscedasticity.

Autocorrelation

This happens when data is collected over time and the model fails to detect any time trends. Due to this, errors in the model are correlated positively over time, such that each error point is more similar to the previous error. This is known as autocorrelation, and it can sometimes be detected by plotting the model residuals versus time. Such correlations frequently occur in the context of time series data, which consists of observations for which measurements are obtained at discrete points in time.

In order to determine whether this is the case for a given data set, we can plot the residuals from our model as a function of time. If the errors are uncorrelated, then there should be no observable pattern. However, on the other hand, if the consecutive values appear to follow each other closely, then we may want to try an autoregression model.

Multicollinearity

If two or more of the predictors are linearly related to each other when building a model, then these variables are considered multicollinear. A simple method to detect collinearity is to look at the correlation matrix of the predictors. In this correlation matrix, if we have a high absolute value for any two variables, then they can be considered highly correlated. A better method to detect multicollinearity is to calculate the variance inflation factor (VIF).

When faced with the problem of collinearity, we can try a few different approaches. One is to drop one of the problematic variables from the regression model. The other is to combine the collinear variables together into a single predictor. Regularization helps here as well.

Overfitting

When a model is too complex, it may lead to overfitting. It means the model may produce good training results but would fail to perform well on the test data. One possible solution for overfitting is to increase the amount and diversity of the training data. Another solution is regularization.

Extrapolation

Extrapolation occurs when we use a linear regression model to make predictions for predictor values that are not present in the range of data used to build the model. For instance, suppose we have built a model to predict the weight of a child given its height, which ranges from 3 to 5 feet. If we now make predictions for a child with height greater than 5 feet or less than 3 feet, then we may get incorrect predictions. The predictions are valid only within the range of values that are used for building the model. Hence, we should not extrapolate beyond the scope of the model.


Overfitting

When a model performs really well on the data that is used to train it, but does not perform well with unseen data, we know we have a problem: overfitting. Such a model will perform very well with training data and, hence, will have very low bias; but since it does not perform well with unseen data, it will show high variance.

  • When Model is too complex, the bias is low, while the variance is high. It has essentially memorized the whole dataset instead of creating a general pattern from it.
  • When Model is too simple, the bias is high, while the variance is low. It means that the model has not learned anything at all. This is called Underfitting. Here both the training and testing error is going to be high.
  • Refer this for more details

Regularization

Regularization helps with managing model complexity by essentially shrinking the model coefficient estimates towards 0. This discourages the model from becoming too complex, thus avoiding the risk of overfitting.

For Linear Regression, Model Complexity depends on

  • Magnitude of the coefficients
  • Number of coefficients

In OLS,

  • minimizing the cost function leads to low bias, or in other words Overfitting.
  • The coefficients obtained can be highly unstable.

    • only few predictors are significantly related to target
    • multicollinearity

To solve this,

  • Add a penalty to cost term i.e Cost function = RSS + Penalty
  • Error Function: i=1N(wTxiyi)2+λR(w)\displaystyle \sum_{i=1}^{N}(w^Tx_{i}-y_{i})^2+\lambda R(w)
  • There are two common techniques which employ this method.

    • Ridge: R(w)=i=1Nwi2R(w) = \sum_{i=1}^{N}w_{i}^2

    • Lasso: R(w)=i=1NwiR(w) = \sum_{i=1}^{N}|w_{i}|

When we regularize, we sacrifice a little bias in favor of a significant reduction in variance.

One sign of overfitting is extreme values of model coefficients. Hence, regularization helps as it shrinks the coefficients towards 0.

Tradeoff between Error and Regularization

  • Suppose λ1>λ2\lambda_1 > \lambda_2 and with λ1\lambda_1 the most optimal θ\theta is θ1\theta_1 while with λ2\lambda_2 the most optimal θ\theta is θ2\theta_2
  • It follows that E(θ1)+λ1R(θ1)E(θ2)+λ1R(θ2)E(\theta_1) + \lambda_1R(\theta_1) \le E(\theta_2) + \lambda_1R(\theta_2) (1)
  • and E(θ2)+λ2R(θ2)E(θ2)+λ2R(θ2)E(\theta_2) + \lambda_2R(\theta_2) \le E(\theta_2) + \lambda_2R(\theta_2) (2)
  • If we combine and shuffle (1) and (2)
  • λ2(R(θ2)R(θ1))E(θ1)E(θ2)λ1(R(θ2)R(θ1))\lambda_2(R(\theta_2)-R(\theta_1)) \le E(\theta_1) - E(\theta_2) \le \lambda_1(R(\theta_2)-R(\theta_1))
  • Since λ1>λ2\lambda_1 > \lambda_2, and λ1δRλ2δR\lambda_1\delta R\ge \lambda_2\delta R, we can say that δR0\delta R \ge 0
  • Hence, E(θ1)E(θ2)  &  R(θ1)R(θ2)E(\theta_1)\ge E(\theta_2) \;\&\; R(\theta_1)\le R(\theta_2)
  • When we increase the value of λ, the error term will increase and regularisation term will decrease and the opposite will happen when we decrease the value of λ.

Ridge Regression

In OLS, we get the best coefficients by minimising the residual sum of squares (RSS). Similarly, with Ridge regression also, we estimate the model coefficients, but by minimising a different cost function. This cost function adds a penalty term to the RSS.

OLS Cost Function=i=1N(yiy^i)2\displaystyle \text{OLS Cost Function} = \sum_{i=1}^{N}(y_{i} - \hat y_{i})^2

Ridge Cost Function=i=1N(yiy^i)2+λj=1Pβj2\displaystyle \text{Ridge Cost Function} = \sum_{i=1}^{N}(y_{i} - \hat y_{i})^2 + \lambda\sum_{j=1}^{P}\beta_j^2

  • Here, λj=1Pβj2\lambda\sum_{j=1}^{P}\beta_j^2 is the penalty term
  • It has the effect of shrinking β\betas towards zero to minimize cost

One point to keep in mind is that we need to standardise the data whenever working with Ridge regression. We have seen that regularization puts a constraint on the magnitude of the model coefficients. Then the penalty term depends upon the magnitude of each coefficient. This makes it necessary to centre or standardise the variables.

Role of Lambda

If lambda is 0, then the cost function would not contain the penalty term and there will be no shrinkage of the model coefficients. They would be the same as those from OLS. However, since lambda moves towards higher values, the shrinkage penalty increases, pushing the coefficients further towards 0, which may lead to model underfitting. Choosing an appropriate lambda becomes crucial: If it is too small, then we would not be able to solve the problem of overfitting, and with too large a lambda, we may actually end up underfitting.

Another point to note is that in OLS, we will get only one set of model coefficients when the RSS is minimised. However, in Ridge regression, for each value of lambda, we will get a different set of model coefficients.

Higher λ    \lambda \implies more regularization

  • if λ\lambda is too high, it will lead to underfitting
  • if λ\lambda is too low, it will not handle overfitting

Summary - Ridge Regression

  • Ridge regression has a particular advantage over OLS when the OLS estimates have high variance, i.e., when they overfit. Regularization can significantly reduce model variance while not increasing bias much.
  • The tuning parameter lambda helps us determine how much we wish to regularize the model. The higher the value of lambda, the lower the value of the model coefficients, and more is the regularization.
  • Choosing the right lambda is crucial so as to reduce only the variance in the model, without compromising much on identifying the underlying patterns, i.e., the bias.
  • It is important to standardise the data when working with Ridge regression.
  • The model coefficients of ridge regression can shrink very close to 0 but do not become 0 and hence there is no feature selection with ridge regression. This can cause problems with interpretability of the model if the number of predictors is very large.

Lasso Regression

Lasso Cost Function=i=1N(yiy^i)2+λj=1Pβj\displaystyle \text{Lasso Cost Function} = \sum_{i=1}^{N}(y_{i} - \hat y_{i})^2 + \lambda\sum_{j=1}^{P}|\beta_j|

  • Here, λj=1Pβj\lambda\sum_{j=1}^{P}|\beta_j| is the penalty term
  • If λ\lambda is large enough, the coefficient for some of the variables will become zero. Hence it performs variable selection.
  • Models generated from Lasso are generally easier to interpret than those produced by Ridge Regression
  • λ    variance  bias  rss\lambda \uparrow \implies variance \downarrow \; bias \uparrow \; rss \uparrow
  • Standardising the variable is necessary for lasso as well
  • No closed form solution is possible as compared to ridge
  • Gives a sparse solution i.e many of the model coefficients auomtically become exactly zero, wi=0w_{i}=0
  • If θ\theta^* is the best model that we end up getting, which is given as:

    • θ=argmin[E(θ)+λR(θ)]\theta^* = argmin[E(\theta)+\lambda R(\theta)], then
    • λ    Sparsity(θ)\lambda\uparrow \implies Sparsity(\theta^*) \uparrow where Sparsity(θ)Sparsity(\theta^*) of a model is defined by the number of parameters in θ\theta^* that are exactly equal to zero.
  • Choosing correct λ\lambda is important. If it is too large, all coefficients will become zero.

Summary - Lasso Regression

  • The behaviour of Lasso regression is similar to that of Ridge regression.
  • With an increase in the value of lambda, variance reduces with a slight compromise in terms of bias.
  • Lasso also pushes the model coefficients towards 0 in order to handle high variance, just like Ridge regression. But, in addition to this, Lasso also pushes some coefficients to be exactly 0 and thus performs variable selection.
  • This variable selection results in models that are easier to interpret.

Generally, Lasso should perform better in situations where only a few among all the predictors that are used to build our model have a significant influence on the response variable. So, feature selection, which removes the unrelated variables, should help. But Ridge should do better when all the variables have almost the same influence on the response variable.

It is not the case that one of the techniques always performs better than the other – the choice would depend upon the data that is used for modelling.

We often have a large number of features in real-world problems, but we want the model to be able to pick up only the most useful ones (since we do not want unnecessarily complex models). Since lasso regularisation produces sparse solutions, it automatically performs feature selection.

Questions

As λ increases from 0 to infinity, what is the impact on Variance, RSS, Test Error, Ridge Coefficients, Lasso Coefficients and Bias of the model?

  • Variance decreases: When λ=0, the alphas have their least square estimate values. The actual estimates heavily depend on the training data and hence variance is high. As we increase λ, alphas start decreasing and model becomes simpler. In the limiting case of λ approaching infinity, all betas reduce to zero and model predicts a constant and has no variance.
  • RSS increases: Differentiating the cost function with lambda=0 gives the value of the coefficients which minimizes the RSS. Again, putting λ = infinity gives us a constant model with maximum RSS. Thus, the RSS steadily increases with the variation of lambda
  • Bias increase: When λ=0, alphas have their least-square estimate values and hence have the least bias. As λ increases, alphas start reducing towards zero, the model fits less accurately to training data and hence bias increases. In the limiting case of λ approaching infinity, the model predicts a constant and hence bias is maximum.
  • Test error will be high: Even though the variance will be very low, test error will be high as the model would not have captured the behaviour of the data correctly.
  • Ridge will lead to some of the coefficients to be very close to 0.
  • Lasso will cause some of the coefficients to be 0.

Ridge vs Lasso

Visualizing a function

VisualizingFunction

  • For a function given as f(x,y)f(x, y), a third axis, say Z is used to plot the output values of the function, and once we connect all the values, we get a surface.

Visualizing Function using Contours

Contours

  • For contours, we join all the points where the function value is the same and we end up getting a contour for the function, which is the visual representation of the function.
  • No two contours can intersect, which means a function can't have two different values for a combination of x and y but two contours can meet tangentially.

Visualizing Difference between Ridge and Lasso

RvL Observing the plots above, we see that we can get the model coefficients to become 0 only if the ellipses touch the constraint region on either the x or the y axis. Since the Ridge regression constraint is circular, without any sharp points, the ellipse will generally not touch the circular constraint region at the axis.

Hence, the coefficients can become very small but would not become 0. In the case of Lasso regression, since the diamond constraint has a corner at each axis, the ellipse would touch the constraint at any of its corners often, resulting in that coefficient becoming 0. In higher dimensions, since there will be a higher number of corners, a higher number of coefficients can become 0 at the same time.

This is the reason Lasso regression can perform feature selection.

Takeaways

  • If hyperparameter lambda is high, it will lead to underfitting, while if it is low, it will not handle overfitting.
  • Ridge regression does not make the coefficient zero, while lasso regression does.
  • Both ridge and lasso have the same idea of adding a penalty to the cost function based on the weight of the coefficients. in ridge, the penalty is on the sum of squares of the coefficients, while in lasso it is on the sum of the absolute value of the coefficients.

References

Thank the author. Fork this blog.


Tagged in pythonmachine-learningpredictive-analysislinear-regression