As you know, literal meaning of regularization is to manage or control things. Machine learning model also demands regularization sometimes. Through this post, you will be able to know about what is regularization in machine learning, why does machine learning model need it , different regularization techniques like L1 and L2 regularization methods , dropout and data augmentation and how to implement them.
Table of content:
- What is Regularization?
- Need of Regularization
- Overview of regularization methods
- L1 Regularization
- L2 Regularization
- Data Augmentation
- Early stopping
What is Regularization Technique?
As I mentioned above, regularization means to control or manage the thing.
But what does it account for when it comes to data science?
From data science perspective, regularization is a technique or method, used to control overfitting or to select features from high dimensional data.
Why regularization is needed?
Let’s understand it with an example.
Whenever a machine learning model is trained, sometimes it shows very low training error to training dataset and higher error in testing set as compared to training data. If the difference is smaller, then it doesn’t matter. But if it shows larger difference, it means machine learning model is not generalized well. It means, it has mastered in learning on trained data so well that if other data which is different from the train data comes, model will fail to predict accurately. This problem is known as “Overfitting”
Using regularization methods, some modifications are done in the training process so that weight are updated in more generalized way.
Methods to implement regularization techniques
Basic purpose of regularization techniques is to control the process of model training. It can be in following ways:
- L1 Regularization (Lasso Regression)
- L2 Regularization (Ridge Regression)
- Dropout (used in deep learning)
- Data augmentation (in case of computer vision)
- Early stopping
Using L1 regularization method, unimportant features can also be removed. That’s why L1 regularization is used in “Feature selection” too.
L1 Regularization (Lasso Regression)
L1 and L2 regularization techniques add penalty terms to the loss function so that whenever weights or coefficients are being updated, they can be tweaked by this additional term.
How does L1 regularization (Lasso regression) work?
To increase accuracy of prediction, cost function should be minimized. To optimize cost function we need to find minima (use gradient descent algorithm). Let’s check how?
As you can see because of lambda, weight matrix can be penalized or modified in such a way that overfitting can be prevented.
Intuitions from L1 regularization technique
- As you can see Lambda is independent from weight matrix so it’s value can be set to make updated weight to Zero. Thus L1 regularization makes the model sparse as features can be reduced to zero. That’s why it can be used for “Feature Selection” too.
- By changing the value of Lambda, weights or coefficients can be modified or shifted away to make the model more generalized or to reduce overfitting.
- In case of neural network, if some of coefficients are reduced to zero then neural network will be more simplified and it may act similar to logistic regression.
L2 Regularization (Ridge Regression)
In L2 regularization, penalty term is added as square of coefficients or weight matrix.
Now let’s check how does L2 regularization helps to prevent overfitting:
Intuitions based on L2 regularization
- As you can see, lambda helps in reducing the updated weight to prevent overfitting
- Also, with L2 regularization, it’s not possible to make any feature zero as lambda term is associated with weight also. Lambda is not independent.
Why to use coefficients or weights only in penalty term?
Reason: In penalty term, you can add ‘b'(bias) term as well if you want but generally weight matrix is used as weight matrix is high dimensional and it can control all parameters of machine learning or deep learning model.
Dropout is used for training through deep learning.
While training a model through deep learning, if all neurons are considered as input at one time, model will be very complex and it will take high computational time to process all neuron inputs. Also, high chances that model will over-fit.
Therefore, as the name suggests, some percentage of neurons out of total input neurons are selected to feed in the neural network at one time. which input neuron will be triggered at one time and how many neurons are to be dropped at one time, it is decided randomly (it’s kind of hit and trial method or hyper-parameter tuning)
What’s the benefit of doing this?
Suppose you have 30000 input neurons, without dropout you need to process all of these at one time. But if you have used 20% dropout, you need to feed only 24000 inputs at one time. Now model will learn with less features at one time, so problem of overfitting will be resolved.
Data augmentation is a process of creating additional data using existing data. Data augmentation is usually used with image data. In computer vision data is created by rotating, cropping, blurring of the image.
While training of model, when testing error tend to increase as compare to training error, then training should be stopped to prevent overfitting.