**Introduction**

Before diving into the core of the bias vs variance trade-off, we’ll first introduce the concept of overfitting and underfitting, as the two notions are closely related.

When testing our statistical models, we often want to avoid overfitting and underfitting our models as we are interested in building models that are consistent at making good predictions.

In machine learning, we frequently partition the data into a training and testing set. The training set is used to train our model and find the best set of parameter values that would fit the data, while the test data is used to measure its performance. The problem arises when we train the model for too long on any data set. When this happens, we say that it starts to learn the ‘noise’[1] in the data and fits too closely to the training set. The model then becomes unable to generalize on new data.

**Theory**

Overfitting is described as a modeling error in statistics where the model is too closely aligned with a limited set of data points. In other words, the model gets extremely good at predictions with a limited set of datapoints (training data), and when we test out the model in a new set of data points (test data) it performs differently, often poorly compared to the training set.

On the other hand, underfitting, usually the worse case scenario, is where the model is usually unable to model the training data properly nor generalize the new test data.

In summary,

Overfitting | Underfitting | |

Training data | Good performance | Poor performance |

Test data | Poor generalization | Poor generalization |

Now that you’ve understood the difference between overfitting and underfitting, we’ll introduce the concept of bias and variance trade-off.

The bias is defined as the expected error in predictions of our model. In other words, the difference between our estimator (estimate of our model or parameter) and the true value of this estimator.

Let y be the true value of our estimator and let ŷ be our estimator (The expected value of our parameter).

The bias can be represented by the following equation:

*Bias = E(y) – y*

*Bias = ŷ – y*

Bias is usually introduced when we make errors of assumptions in the model.

The linear model is a classic example of a model that very often introduces a bias. When building our model, we might presume that the relationship between our dependant and independent variable is linear and try to apply the linear regression on our data. If the relationship between our two variables happened to not be linear, we would have introduced bias in our model, because the linear model we’ve just built would not fit with our data.

Contrastively, variance is defined as being a measure of variability in the results obtained by our model. When we test a model on different data sets and we see that the results vary a lot between different test sets, we would say that the model has a high variance. If the results are somewhat close to each other, we would say that the model has a low variance.

In an ideal world, the best would be to have a model with a small bias and a small variance.

Fundamentally, the bias variance trade-off is based on the breakdown of the mean squared error:

1.MSE(y) = E [*(ŷ – y) ^{2}*]

2.MSE(y) = *Bias(y) ^{2} + Var(ŷ) + *ε

3.MSE(y) = *Bias ^{2} + Variance + Irreducible Error*

For those more mathematically inclined, please refer to the section dealing with cost functions to see how we went from the first equation to the second.

To better illustrate its relationship, let’s analyse *Figure 1*.

We are presented with four target schemas and inside the schemas, we have data points which represent different predictions made from a model. The red dot represents the true value that the model is trying to predict.

On the upper left, we have a model that has a low bias and low variance. The model in this case is very good at making predictions and is consistently good at making predictions. Thus, hitting the red dot on the target symbol on many occasions.

On the bottom left, we have a model that has a low variance and high bias, which results in a model that can make consistently close predictions, but they are all off-target.

On the upper right, we have a model that has low bias, as we are a little bit close to the red dot, but it has high variance because the points are so far off from each other.

Finally on the bottom right, we have the worst-case scenario, which is a model that has high bias and variance and cannot at all make good and consistent predictions.

You might now wonder how does the bias and variance trade-off relate to underfitting and overfitting of our models?

Let’s start by looking at the picture below.

On the right, you have a simple linear model that’s been built assuming that the data followed a linear relationship. However, by looking at the graph, we notice that the model is *‘*biased*’ *because it ignores part of the data. If that ignored part of the data happened to be important, then the predictions will be consistently in error.

In this case, this model is said to be underfit and will likely generalize poorly when fed new data. Therefore, a high bias model often leads to an underfitting problem where the model performs poorly on the current model and generalize poorly on a new set of data.

Our second model uses every scrap of data it can get and fits the data as close as possible. From the graph above, we see that there is no bias in the model, as it captures all the ‘noise’ in the data. The models who can capture every datapoint like this are often models that are very complex or flexible in nature and have been fine tuned for a particular dataset. However, because they are models that have been repeatedly trained on the same dataset sample, then they are bound to have very poor generalization abilities on new data.

As a rule, the more flexible or complex a model is, the more variance it will have and the less bias it has.

Later we will introduce concepts that will allow us to mitigate the overfitting and underfitting problem that we experience in our models.

[1] Noisy data is referred to as being data that contains meaningless information.