Machine & Deep Learning Compendium

Searchβ¦

Regularization

**Bigger lambda -> high complexity models (deg 3) are ruled out, more punishment.****Smaller lambda -> models with high training error are rules out. I.e., linear model on non linear data?, i.e., deg 1.****Optimal is in between (deg 2)**

β**Rehearsal on vector normalization**** - for l1,l2,l3,l4 etc, what is the norm? (absolute value in certain cases)**β

**L1 - moves the regressor faster, feature selection by sparsing coefficients (zeroing them), with sparse algorithms it is computationally efficient, with others no, so use L2.****L2 - moves slower, doesn't sparse, computationally efficient.**

**L1 & L2 regularization add constraints to the optimization problem. The curve H0 is the hypothesis. The solution is a set of points where the H0 meets the constraints.****In L2 the the hypothesis is tangential to the ||w||_2. The point of intersection has both x1 and x2 components. On the other hand, in L1, due to the nature of ||w||_1, the viable solutions are limited to the corners of the axis, i.e., x1. So that the value of x2 = 0. This means that the solution has eliminated the role of x2 leading to sparsity.****This can be extended to a higher dimensions and you can see why L1 regularization leads to solutions to the optimization problem where many of the variables have value 0.****In other words, L1 regularization leads to sparsity.****Also considered feature selection - although with LibSVM the recommendation is to feature select prior to using the SVM and use L2 instead.**

**For simplicity, let's just consider the 1-dimensional case.****L2:****L2-regularized loss function F(x)=f(x)+Ξ»β₯xβ₯^2 is smooth.****This means that the optimum is the stationary point (0-derivative point).****The stationary point of F can get very small when you increase Ξ», but it will still won't be 0 unless fβ²(0)=0.****L1:****regularized loss function F(x)=f(x)+Ξ»β₯xβ₯ is non-smooth, i.e., a min knee of 0.****It's not differentiable at 0.****Optimization theory says that the optimum of a function is either the point with 0-derivative or one of the irregularities (corners, kinks, etc.). So, it's possible that the optimal point of F is 0 even if 0 isn't the stationary point of f.****In fact, it would be 0 if Ξ» is large enough (stronger regularization effect). Below is a graphical illustration.**

Last modified 9mo ago

Copy link