Overfitting in Machine Learning

In this page we, we will learn Overfitting in Machine Learning, Example to Understand Overfitting, How to detect Overfitting?, Ways to prevent the Overfitting, Early Stopping, Train with More data, Feature Selection, Cross-Validation, Data Augmentation, Regularization, Ensemble Methods.

In the actual world, there will never be a clean and perfect dataset. It means that each dataset has contaminants, noise, outliers, missing data, or data that is unbalanced. Different difficulties arise as a result of these contaminants, affecting the model's accuracy and performance. Overfitting in Machine Learning is one of these issues. Overfitting is a problem that can occur in a model.

“A statistical model is said to be overfitted if it can't generalize well with unseen data.”

Before understanding overfitting, we need to know some basic terms, which are:

Noise: It is data in a dataset that is meaningless or irrelevant. If it is not eliminated, it has an impact on the model's performance.

Bias: It is a prediction inaccuracy incorporated into the model as a result of oversimplifying machine learning algorithms. Or it could be the disparity between projected and actual numbers.

Variance: It happens when a machine learning model performs well on the training dataset but not on the test dataset.

Generalization: It shows how well a model is trained to predict unseen data.

The two main errors/problems in the machine learning model that cause poor performance in Machine Learning are overfitting and underfitting.
When a model is overfitted, it tries to fit more data than is required, and it strives to capture each and every datapoint that is provided to it. As a result, it begins to capture noise and erroneous data from the dataset, lowering the model's performance.
An overfitted model can't generalize effectively and performs poorly with the test/unseen dataset.
Low bias and large variance are characteristics of an overfitted model.

Example to Understand Overfitting

With a general example, we can understand overfitting. Assume there are three students, X, Y, and Z, who are all studying for a test. X has only read three sections of the book and has abandoned the rest. Y has an excellent memory, so he memorized the entire book. Z, the third student, has gone over all of the questions and practiced them. As a result, if the exam contains questions from section 3, X will only be able to answer them. Student Y will be able to answer questions only if they are identical to those in the book. Student Z will be able to correctly answer all of the exam questions.

The similar thing happens with machine learning: if the algorithm only learns from a tiny portion of the data, it will not be able to gather all of the essential data points, resulting in underfitting.

Assume the model, like the Y student, learns the training dataset. They perform well on a known dataset, but not so well on new data or cases. The model is said to be Overfitting in such cases.

And, like student Z, if the model performs well with both the training and test/unseen datasets, it is regarded to be a good match.

How to detect Overfitting?

Only by testing the data can you find overfitting in the model. We can use a Train/Test split to find the problem.

We can divide our dataset into random test and training datasets using the train-test split. We use a training dataset that makes up around 80% of the overall dataset to train the model. We test the model with the test dataset, which is 20% of the total dataset, after it has been trained.

If the model performs well on the training dataset but not on the test dataset, it is most likely due to overfitting.

For example, if the model is 85 % accurate with training data but only 50% accurate with test data, it isn't doing well.

Ways to prevent the Overfitting

Although overfitting is a machine learning issue that affects the model's performance, there are numerous approaches to avoid it. We can avoid overfitting by using a linear model; unfortunately, many real-world issues are non-linear. It's crucial to keep the models from overfitting. Overfitting can be avoided in numerous ways, as listed below:

Early Stopping
Train with more data
Feature Selection
Cross-Validation
Data Augmentation
Regularization

Early Stopping

Before the model learns the noise within the model, the training is interrupted in this strategy. During this phase, measure the model's performance after each iteration while it is being trained iteratively. Continue until a new iteration improves the model's performance, up to a given number of iterations.

The model begins to overfit the training data after that stage, thus we must terminate the process before the learner reaches that point.

Stopping the training process before the model starts capturing noise from the data is known as early stopping.

However, this technique may lead to the underfitting problem if training is paused too early. So, it is very important to find that "sweet spot" between underfitting and overfitting.

Train with More data

Increasing the training set by including more data can improve the model's performance by increasing the number of opportunities to identify the association between input and output variables.

It may not always work to prevent overfitting, but this method aids the algorithm in better detecting the signal and reducing errors.

When a model is fed additional training data, it becomes unable to overfit all of the data samples and is forced to generalize well.

However, in some circumstances, the additional data may introduce more noise into the model; hence, before feeding data to the model, we must ensure that it is clean and devoid of inconsistencies.

Feature Selection

We have a number of parameters or features that we utilize to forecast the outcome while developing the ML model. However, some of these features are redundant or less significant for prediction, and a feature selection procedure is used to account for this. We find the most important characteristics within training data during the feature selection phase, and other features are discarded. Furthermore, this procedure aids in the simplification of the model and the reduction of data noise. Some algorithms offer automatic feature selection, but if they don't, we can do it ourselves.

Cross-Validation

Cross-validation is one of the most effective methods for avoiding overfitting.
We partitioned the dataset into k-equal-sized subsets of data, known as folds, in the general k-fold cross-validation procedure.

Data Augmentation

To avoid overfitting, data augmentation is a data analysis strategy that offers an alternative to adding more data. Instead of adding fresh training data, this method adds slightly changed copies of already existing data to the dataset.

The data augmentation technique allows data samples to appear somewhat different each time the model processes them. As a result, each data set appears to be unique to the model, avoiding overfitting.

Regularization

We can lower the amount of features if overfitting happens when a model is complex. Overfitting can also happen with a simpler model, such as the Linear model, and regularization approaches can help in these situations.

Regularization is the most widely used method for avoiding overfitting. It's a set of techniques that forces learning algorithms to simplify a model. Applying the regularization procedure increases the bias slightly but reduces the variance marginally. In this strategy, we add a penalizing component to the goal function, which has a higher value with a more complex model.

L1 Regularization and L2 Regularization are the two most often used regularization algorithms.

Ensemble Methods

Ensemble methods integrate predictions from various machine learning models to find the most popular outcome.

Bagging and Boosting are the most widely utilized ensemble approaches.

Individual data points might be selected many times in bagging. Following the gathering of numerous sample datasets, these models are trained individually, and the average of those predictions is utilized to predict a more accurate outcome, depending on the kind of task (regression or classification). Bagging also decreases the risk of overfitting in complex models.

Boosting is a method of training a large number of weak learners in a sequence so that each learner in the chain learns from the mistakes of the learner before it. It combines all of the weak learners into one powerful learner. Furthermore, it increases the predictability of basic models.