On the previous article, we named overfitting and underfitting but we did not go deep into details about them. Let’s just take a deeper dive on them.
When we work with a set of data to predict or classify a problem we try to achieve our goals implementing a model using the training data and testing it with the testing data. We can make adjustments based on the characteristics we are using or the model itself.
Modifying the model we can end up with a too simple model or a too complex model. Here is when we need to consider the overfitting and underfitting concepts.
As we can see on the image, the underfitting concept refers to a model that can neither model the training data nor generalize to new data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.
It happens when we do not have enough data to build a precise model or when we try to build a linear model with non-linear data.
There are a few techniques we can try to prevent underfitting:
- Sometimes the model is underfitting because the feature items are insufficient. In this case, we can add other feature items to unfold it well.
- Add polynomial features, which are usually utilized as a part of the machine learning algorithm. For example, the linear model is more generalized by adding quadratic or cubic terms.
- Reduce the regularization parameters. The motivation behind regularization is to prevent overfitting, yet now the model has an underfitting, we have to diminish the regularization parameters.
On the opposite side, the overfitting concept refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.
Overfitting is more probable in non-parametric and non-linear models.
There are a few techniques we can try to prevent overfitting:
- Cross-validation: It uses our initial training data to generate multiple mini train-test splits, and it uses these splits to tune our model. Cross-validation allows us to tune hyperparameters with only our original training set. This allows us to keep our test set as a truly unseen dataset for selecting our final model.
- Train with more data: Training with more data can help algorithms detect the signal better but, if we just add more noisy data, this technique won’t help. That’s why we should always ensure our data is clean and relevant.
- Remove features: We can manually improve algorithms generalizability by removing irrelevant input features. The criteria to remove them, if anything does not make sense, or if it is hard to justify, this is a good candidate to be removed.
- Early stopping: When training an algorithm, we can measure how well each iteration of the model performs. Up until a certain number of iterations, new iterations improve the model. After that point, the model’s ability to generalize can weaken as it begins to overfit the training data. Early stopping refers to stopping the training process before the learner passes that point.
- Regularization: Regularization refers to a broad range of techniques for artificially forcing our model to be simpler.
- Ensembling: Ensembles are machine learning methods for combining predictions from multiple separate models.
The good one
Finally, looking at the middle graph it shows a pretty good predicted line. It covers the majority of the points in graph and also maintains the balance between bias and variance.
That is all for today.